CUDA: is this related to register usage per thread?

Question

I have a 3D stencil computation running on a Kepler cc3.0. I am using CUDA blocks of size 32 x 4 x 4 which is 512 threads.

Something is strange though. I get wrong values already read from the first lines of code in the kernel, only if i increase the size of the problem to L=128 or higher, always in powers of two for correct padding. The maximum amount of registers per thread on cc3.0 Kepler is 63 i think. Ptxas output tells

ptxas info    : Compiling entry function '_Z17kernel_metropolisiiPiS_PfffS_i' for 'sm_30'
ptxas info    : Function properties for _Z17kernel_metropolisiiPiS_PfffS_i
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 8160 bytes smem, 372 bytes cmem[0], 8 bytes cmem[2]

It shows 48 registers, which is fine. However, if i add a 'return' staement some lines of code earlier in the kernel, the program compiles the kernel into 45 registers and then the memory reads are ok again.

This problem does not occur if i choose L=32 or L=64, in those cases results come perfect. I am really not sure if it is a register problem or something else, because from what i knew, a register per thread problem should not appear/dissapear by changing the problem size, since it depends on the block configuration, and of course, the kernel code, is that correct?.

A direction to where to start looking is good and enough for me to go on my own with the details. Thanks in advance.

*"That is a typical symptom of register per thread problem"* What? IF you want an answer, then you are going to have to show us some code, otherwise I fail to see how anyone could possibly help you. — talonmies, May 31 '14 at 07:02
The only time "registers per thread" variation should result in incorrect results is when a kernel cannot be launched due to too many registers per thread requested. And with [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) this condition is easy to spot. Stencil ops are hard to get correct, especially on the edge of data sets where the overall data set size is not an even multiple of the stencil dimensions. Agree with talonmies, and voting to close. — Robert Crovella, May 31 '14 at 14:23
@talonmies sure i can show some code. Let me prepare something simpler — labotsirc, May 31 '14 at 15:32
I have added the code and made a better explanation, i hope you see this question again, thanks. — labotsirc, May 31 '14 at 19:03
@Robert i have checked errors but i get none after each kernel call. — labotsirc, May 31 '14 at 20:45
The kernel by itself is not that useful. Provide a complete, compilable code. — Robert Crovella, May 31 '14 at 23:03
@RobertCrovella I will provide a link to the *.tar so you can use the Makefile compile and reproduce the problem hopefully. — labotsirc, May 31 '14 at 23:10
I don't want to see your whole code. Reduce it down to just the necessary pieces to reproduce the problem (Yes, that requires effort on your part), and post all of that in the question, not an external link. Those are expectations for a good question on SO. — Robert Crovella, May 31 '14 at 23:16
@Robert Nevermind. I will continue by myself and will put the solution when i fix the problem, meanwhile the question is still open unless 2 more people close it. Thanks. — labotsirc, May 31 '14 at 23:43

labotsirc · Accepted Answer · 2014-06-01T06:59:47.970

The problem of that kernel is not register per thread issue, but the following line:

int tid = z*L*L/2 + (blockIdx.y * BY/2 + threadIdx.y)*L + x;

Which should had been:

int tid = z*L*L/4 + (blockIdx.y * BY/2 + threadIdx.y)*L + x;

That mistake made an out of bounds access on the vector of random number generators and caused a series of unfortunate events.

It was very tricky because cuda-memcheck did not detect the error.

CUDA: is this related to register usage per thread?

1 Answers1