Detailed description |
|
In assignToClusters_KMCUDA kernel, the coalesced prefetch is working incorrectly, because it assumes that all blockDim.x threads are active, while in the last block some might stall due to (t<N) condition.
The line campaign/trunk/dev/kmeansGPU/kmeansGPU.cu:159 in the last block gets executed only for (N%blockDim.x) threads, which leads to incorrect loading of cluster centroid coordinates into shared memory if D>(N%blockDim.x).
The smoke test does not catch it because in test dataset D=1, therefore even one active thread in block is enough to load centroid coordinate.
I attach an example of possible patch, albeit it was not tested extensively. |
|