I have a 3D stencil computation running on a Kepler cc3.0. I am using CUDA blocks of size 32 x 4 x 4 which is 512 threads.
Something is strange though. I get wrong values already read from the first lines of code in the kernel, only if i increase the size of the problem to L=128 or higher, always in powers of two for correct padding. The maximum amount of registers per thread on cc3.0 Kepler is 63 i think. Ptxas output tells
ptxas info : Compiling entry function '_Z17kernel_metropolisiiPiS_PfffS_i' for 'sm_30'
ptxas info : Function properties for _Z17kernel_metropolisiiPiS_PfffS_i
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 48 registers, 8160 bytes smem, 372 bytes cmem[0], 8 bytes cmem[2]
It shows 48 registers, which is fine. However, if i add a 'return' staement some lines of code earlier in the kernel, the program compiles the kernel into 45 registers and then the memory reads are ok again.
This problem does not occur if i choose L=32 or L=64, in those cases results come perfect. I am really not sure if it is a register problem or something else, because from what i knew, a register per thread problem should not appear/dissapear by changing the problem size, since it depends on the block configuration, and of course, the kernel code, is that correct?.
A direction to where to start looking is good and enough for me to go on my own with the details. Thanks in advance.