This week, I realised the waiting mechanism as I wrote last week. And ran on CELL BE machine.
I choosed the 8*8 data and devided it into 4 blocks. Single execute time is about 0.005s and the parallel time is about 0.010s. It means I still have a lot of work to do.
Go through the code I wrote again, I find out a lot of “for” and replacement, which may cause the delay of the execution.
To do’s: optimistic and make the code flexible.
1.flexible, collect the common data and set as define
2.cache result, each spe should has its own cache list which will use by itself.
3.ppu and spu should choose the “next one” to continue execute. South or east, set default or choose randomly. this will reduce the overhead of data transfer.
4.reduce the create and destroy time. Once we create a spu, we need to make it run as long as it can. This will reduce the overhead of system cost.
Others:
I am wondering, how many data can be transfered between spu and ppu, and how many ways it can?
Our main point is : cache and choose the neighborhood.
Lots of work to do, continue… がんばってください!
Maybe I should start writing my essay now?