"model.to('cuda:6')" becomes (nvidia-smi) GPU 4, same with any other "cuda:MY_GPU", only "cuda:0" becomes GPU 0. How do I get rid of this mapping?

Question

Strange mapping: example

In the following example, the first column is chosen in the code, second column is the one that does the work instead:

0:0 1234 MiB
1:2 1234 MiB
2:7 1234 MiB

3:5 2341 MiB

4:1 3412 MiB
5:3 3412 MiB
6:4 3412 MiB
7:6 3412 MiB

Thus, to get 0,4,5,6,7, you code: 0,6,3,7,2

How to check the mapping

I have this strange mapping all the time. You can test it like this:

build a tiny dummy model or
load a pretrained model.

Then put this model on each device, one after the other, and in each step, check the change in !nvidia-smi|tail to see to which GPU the cuda device got mapped. This mapping does not change for the whole session, and the mapping even stays the same after a server relaunch. Thus, it seems to be set by the technical hierarchy of the GPUs which does not change unless you change the hardware.

Code to build some model

Dummy model (quick check, take this)

This tiny dummy tensor is code from PyTorch How to delete PyTorch objects correctly from memory:

import torch
model = torch.Tensor(10,10)

This builds a tiny tensor model. This is better than downloading a full pretrained model as in the next heading if you only want to check the cuda-to-GPU alignments as a test.

Full model (do not take this, take the dummy model instead)

from transformers import (AutoTokenizer, AutoModelForCausalLM, TextDataset, 
    DataCollatorForLanguageModeling, Trainer, TrainingArguments)
def get_model(model_name):
    # Load pre-trained model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    return tokenizer, model
model_name = "dbmdz/german-gpt2"
tokenizer, model = get_model(model_name)

Code to check the devices

And this is the code that I ran for each of the devices, from 0 to 7. After each step, check which of the GPUs was filled with a new memory entry.

# Example for 0:
model = model.to('cuda:0')
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
+-----------------------------------------------------------------------------+
Thus, 0->0.
Example for 1:
model = model.to('cuda:1')
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    2   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
+-----------------------------------------------------------------------------+
Thus, 1->2.
...
Example for 2:
model = model.to('cuda:2')
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    2   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    7   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
+-----------------------------------------------------------------------------+
Thus, 2->7.
Further checks:

3->5.
4->1
5->3
6->4
7->6

Thus to get these,
> 0,1,2,3,4,5,6,7
you need to take devices:
> 0,4,1,5,6,3,7,2
So that at cuda:7, you know the whole mapping only by checking what has changed:
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    1   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    2   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    3   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    4   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    5   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    6   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
|    7   N/A  N/A   1073165      C   .../miniconda3/bin/python3.9     1000MiB |
+-----------------------------------------------------------------------------+

Further setup

The environment variable CUDA_VISIBLE_DEVICES is empty, and by changing it, the mapping will not change. The strange mapping has not changed since the beginning of the project even though I changed this variable a lot:

Question

What can be done to get rid the strange mapping between coded GPU (first column in the first example above) and chosen GPU (second column), that is, if you check the code (model.to('cuda:MY_GPU_NUMBER') against the outcome of (nvidia-smi).

score 2 · Accepted Answer · answered Jan 30 '24 at 09:06

2

You can try setting environment variable CUDA_DEVICE_ORDER to value PCI_BUS_ID to get the ordering to be the same as the physical arrangement in the PCI lanes.

answered Jan 30 '24 at 09:06

noe

28,203
1
49
83

"model.to('cuda:6')" becomes (nvidia-smi) GPU 4, same with any other "cuda:MY_GPU", only "cuda:0" becomes GPU 0. How do I get rid of this mapping?

Strange mapping: example

How to check the mapping

Code to build some model

Dummy model (quick check, take this)

Full model (do not take this, take the dummy model instead)

Code to check the devices

Example for 1:

Example for 2:

So that at cuda:7, you know the whole mapping only by checking what has changed:

Further setup

Question

1 Answers1

Linked