This week, I realised the waiting mechanism as I wrote last week. And ran on CELL BE machine.
I choosed the 8*8 data and devided it into 4 blocks. Single execute time is about 0.005s and the parallel time is about 0.010s. It means I still have a lot of work to do.
Go through the code I wrote again, I find out a lot of “for” and replacement, which may cause the delay of the execution.