Transformers Trainer: "RuntimeError: module must have its parameters ... on device cuda:6 (device_ids[0]) but found one of them on device: cuda:0"

Question

I ask this since I could not fix it with the help of:

Stack Overflow RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:2 or
Stack Overflow Pytorch : Expected all tensors on same device.

Juypter Notebook server

I am on a Jupyter Notebook server, therefore, bash code starts with "!".

You have to begin with the following line in the same Jupyter Notebook cell in which you build the model. Mind that it does not seem to work if you change the environment variable with import os and then os.environ['CUDA_VISIBLE_DEVICES'] = '6,3,7,2', and others ran into the same, see setting CUDA_VISIBLE_DEVICES just has no effect #9158.

!export CUDA_VISIBLE_DEVICES='6,3,7,2'

After changing the model to a DataParallel model, memory should then be spread among GPUs 4,5,6,7. If you ask why the code lists 6,3,7,2 even though it will then work on 4,5,6,7, see "model.to('cuda:6')" becomes (nvidia-smi) GPU 4, same with any other "cuda:MY_GPU", only "cuda:0" becomes GPU 0. How do I get rid of this mapping?.

*The outcome is also without GPU 7 (just 4,5,6), perhaps since it was not needed, and it is not the question since my main aim is to avoid the GPUs 1,2,3 since these are needed for another project. I would also like to spare GPU 0 so that I have only 4,5,6,7, but that is not an urgent need. In short: if you do not face the same mapping problem (6,3,7,2 -> 4,5,6,7), go on with your working setup, but if you have it, check the other link. The right mapping is not the question here.

Code

Main model (run on some chosen GPUs)

Here is how I build the model.

!export CUDA_VISIBLE_DEVICES='6,3,7,2'
from transformers import (AutoTokenizer, AutoModelForCausalLM, AutoConfig, TextDataset, 
    DataCollatorForLanguageModeling, Trainer, TrainingArguments)
def get_model(model_name):
#### from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
return tokenizer, model


import torch
device = torch.device('cuda:6')
print(torch.cuda.device_count())
torch.cuda.set_device(device)
model_name = "dbmdz/german-gpt2"
tokenizer, model = get_model(model_name)
config = model.config
print(next(model.parameters()).device)
device_ids = [6,3,7,2]
model = torch.nn.DataParallel(model, device_ids=device_ids)
print(model.device_ids)
print(next(model.parameters()).device)

Out:

8
cpu
[6, 3, 7, 2]
cuda:0
cpu

Fine-tuning model with the Transformers Trainer class

def make_finetuned_model(tokenizer, model, file_path='myfile.txt', model_name="fine-tuned-model", bln_truncation=True,
                        num_train_epochs=1, per_device_train_batch_size=1, save_steps=10_000):
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=file_path,
    block_size=512,
    overwrite_cache=True,
)

print(next(model.parameters()).device)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

print(next(model.parameters()).device)

model_folder = f&quot;./{model_name}&quot;

# Define the Trainer
trainer = Trainer(
    model=model.to('cuda:6'), # &quot;model=model&quot; should run through as well
    args=TrainingArguments(
        output_dir=model_folder,
        overwrite_output_dir=True,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=per_device_train_batch_size,
        save_steps=save_steps,
    ),
    data_collator=data_collator,
    train_dataset=train_dataset,
)
model.to('cuda:6')
print(next(model.parameters()).device)

# Fine-tune the model
trainer.train()

# Save the model and tokenizer to the fine-tuned model directory
# This is needed since the model config and tokenizer need to be loaded at any load
# Since the fine-tuned model wrapped with DataParallel, save the underlying model with:
model.module.save_pretrained(model_folder)
tokenizer.save_pretrained(model_folder)


make_finetuned_model(tokenizer, model, file_path='myfile.txt', 
                     model_name="fine_tuned_model", bln_truncation=True,
                        num_train_epochs=1, per_device_train_batch_size=1, save_steps=10_000)

Out:

cuda:6
cuda:6
cuda:0
cuda:0

Thus, building the trainer object resets the device to "cuda:0" no matter what you wrote, see the third printout cuda:0 after it has been cuda:6 before. I checked the Huggingface thread Setting specific device for Trainer which is open and busy since August 2022 (!).

Since I chose four other GPUs and GPU 0 comes on top, the error is thrown.

Question

The Transformers Trainer class will always set the device to "cuda:0". How do I get rid of the errors:

RuntimeError: module must have its parameters and buffers on device cuda:6 (device_ids[0]) but found one of them on device: cuda:0

and during the same code work, but more seldomly, and I do not know how to get this error back, code is lost:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:4! (when checking argument for argument index in method wrapper_CUDA__index_select)

questionto42 · Accepted Answer · 2024-02-06T19:21:42.973

There is a quick fix for this, and even though it will not change the work of the Trainer class itself, the error will still vanish.

The error asks you to put your parameters on "cuda:0", therefore I changed the setup and put 0 as the first entry in front of the list. And for that to match with the device that I put the model in, I changed any 'cuda:6' to 'cuda:0'. By this, GPU 0 / 'cuda:0' steers the rest.

!export CUDA_VISIBLE_DEVICES='0,6,3,7,2'
...
torch.device('cuda:0')
...
device_ids = [0,6,3,7,2]
...
        model=model.to('cuda:0'), # "model=model" should run through as well
...
    model.to('cuda:0')
...

This works since it saves the parameters and buffers for both the main model and the trainer object (the fine-tuning model) on the same GPU (GPU 0), and it runs through.

This is not a nice workaround, though, since GPU 0 needs the most memory for the run which would mean that it is the bottleneck. And sometimes, other projects or users might want to have GPU 0 as well, then GPU 0 will be the bottleneck that you want to avoid. But for now, this is better than nothing.

You might also code device_ids = [0,1,2,3,4] (untested) since the device_ids are just the numbers you choose, see os.environ[CUDA_VISIBLE_DEVICES] does not work well. But it does not harm taking the same as the ones in the code, as it runs through.

PS: With this code, I also got rid of the error: