After reading the essays about recent research, the main parallelization schemes for optimization on Cell Processor are: optimize the DMA transfer (single buffering, double buffering and multi-way buffering), optimize the implementation models (how to divide the whole data block into different piece and how to connect each program structures) and optimize the get/put methods (using some prefetch mechanism such as pseudo-result, asynchronous unsafe cache and software cache).
通过几天阅读的论文情况,基本上发现了几种加速并行处理的方式:优化DMA传输方式,缓冲,双缓冲和多重缓冲;优化实现的模型,拆分程序流到不同的SPE,不同的计算模型,会得到不同的实现效果;优化数据读取的方式,比如利用预读取,虚拟结果,和异步多流水线缓存和程序缓存的方式。
So our main method of optimization is waiting-mechanism, which belongs to the “optimize the get/put methods”. Aim to develop a general way about what, when, how and who to cache the result. So the following chapter, we will demonstrate what we need to do.
我们的主要并行加速的优化方式是设计一种普遍能够被利用的数据等待技术,属于优化数据的读取方式。主要目标是解决,我们需要缓存什么东西,什么时候需要缓存,怎样和谁去进行实际的操作几个问题。接下来我们会继续阐述我们要做的事情。
For a general view of the program, program = data + algorithm, that is to say, we will deal with lots of data when the program running. At the same time, different algorithm will have various executing paths. So we will encounter functions. Now we find “what we need to cache”, the array data and the function prams/results.
在程序课上,我们学习过程序等于数据和算法,当然操作系统我们还学习了,运行的程序,还要加上进程块,这里我们不讨论。所以当一个程序运行的时候,我们需要处理大量的数据,同时,不同的算法,会形成不同的程序运算流和程序结构。函数,数组,是我们主要需要缓存的东西。
Array data are the main data types of image processing algorithm, which contains large information needs to be calculated or modified. In the parallel architecture, they need to transfer into different PEs which increased the waiting time of the whole executing. How to speed up the transfer is not our point which contains in the “optimize the DMA transfer”, what we want to do is reduce the transfer of data with waiting mechanism. The second thing we are concentrating on is function prams/results. The result of a function usually will be used not only once in the processing especially image processing. If we could find a good way to cache the result and reuse it when needed, it will accelerate the speed of executing a lot.
数组在图像处理中是比较常见的,因为一幅数字图像一般都是存储在数组里面的,或者我们可以通过长宽和RPG值来进行转换。数组中的大量数据,会在程序的过程中进行大量的运算和修改。在并行结构中,大量的数组数据一般都会交付给不同的执行单元去执行,最后再组合起来,这时就会有大量的数据传输。在这里我们不讨论怎样加快数据传输,属于加速DMA的方面。我们想做的,是通过数据等待技术尽量减少数据的传输。第二种我们主要需要缓存的是函数的指针参数和运算结果,在一个程序的处理过程中,一个函数的结果往往会不止一次的被用到,比如斐波那契数列,所以如果我们能够减少重复运算,会大大加速整个程序运算的速度。
So first we are considering is “when” to cache it. When an array data is done one step modifying, and will step up to the next, we can cache it. To reduce the delaying time writes into the memory and read back to proceed. Sometimes, maybe we just keep it into the register if we have enough spaces. Thus, we may need to change the specific executing methods depends on the implementation model. Besides, for the function prams/results, cache it all the time seems to be a good way for reusing. But the memory spaces needs to be considered. So I think we need a special waiting mechanism for it about how to switch and replace the storage.
首先,我们来考虑一下在什么情况下我们需要缓存的问题。当一个数组被读取到寄存器中,进行完一个步骤的运算的时候,我们应该缓存他们。因为需要减少无谓的存储到存储器中,然后再次读取出来进行下一步的运算这个等待过程。这就需要尽可能的将能够读写连续的运算一起进行。这里可能还会涉及到修改程序运行模式的问题。对于函数来说,每次进行一次函数的调用和结果的得到,理论上来讲,我们最好是能够全部缓存下来,但是这样的话,储存空间可能会不够用,所以我们还需要设计一种优化的替换缓存的技术。
Now we are in the most important part about “how” to cache it. For the array data, on each SPE have 128 register, which is enough for normal array data transfer. We just combine the read and write together to avoid the useless back and force transfer. The advanced techs such as SIMD and intrinsic will be used. The function prams/results seems more difficult to overcome “how to” cache it. For the function which contains result (having “return x ;”), we should allocate a special place for cache it. But for void function, the pointer prams, how should we know what needs to be cached? On the contrary, just store the address of the pointer is okay!
接下来我们讨论最关键的技术,怎样去缓存他们。对于数组来说,SPE一般有足够的寄存器空间,我们只需要减少存取就可以,应用一些高级的技术,比如SIMD和向量编程。函数的指针参数和运算结果就比较复杂了,如果有返回值的函数,我们得给每一个划分对应的存储单元来进行缓存,但是没有返回值的函数,就需要对参数里面,我们进行了修改的进行缓存了,因为指针本身就有自己的地址,所以应该可以直接用于存储。
From what I’ve read and learned the past days, I think software cache will be the main scheme. What I need to do is modify and design a new “mylib.h” with the help of software cache, DMA transfer and other APIs. So when we write program with the mylib.h, just use some designed statements, can we easily cache and reuse array data and function prams/results.
从这几天的论文阅读来看,软件缓存应该是我们可以用到的技术,我要设计一个新的头文件,借助它本身的一些程序库文件和DMA传输的方式和其他的接口。这样在以后的程序中,只要调用这个头文件,并且添加相应的语句,就可以选择性的进行是否需要缓存。
Step into the one last question, “who” will control the cache mechanism. Of course, PPE is the main controller takes charge of the whole modifying and transferring mechanism. On one hand, PPE controls the reading and writing data, and deliver tasks to each SPE, so in order to use array data cache, we need to modify and send whole data block to one SPE. Considering the functions, it may be needed universal, so better store into the main memory. On the other hand, when executing, PPE will receive and give out data, function result which tends to be stored. As well as PPE, each SPE also contains it own mechanism about store and read instruction. More details are still under discussing.
最后一个问题是,谁来控制整个缓存的进行。当然,PPE是最主要的控制器,它掌管了所有的数据分配的工作和传输指令的下达。从一个方面上来讲,在最开始的程序初始化的时候,PPE就会分配数据给不同的SPE,为了使数组缓存能够顺利进行,需要将一个数据块交付给一个SPE,减少数据传输,类似于编译里面的。函数方面,考虑到通用的调度,所以缓存在主存可能是一个更好的方案。另一方面,PPE还要处理运行过程中的数据的读写,缓存的存储,刷新,等等一系列操作。除了PPE,SPE同样也要扮演一个调度的角色,有自己的一套机制来选择是否需要存储,是利用LS,还是传输到主存进行缓存。更多详细的设计还在考虑中。
Questions remain consideration: 1. Common things of each algorithm or image processing; 2. How to use a math way to demonstrate what we think; 3.Study Case.
需要继续讨论的问题:1.程序和算法的一些共同点,特别是从图像处理上来讲;2.怎样用一种数学的角度来阐述这些问题。3.我们主要采用的例子。
阅之~~呵呵
^_^
(⊙o⊙)