I ask this since I could not fix it with the help of:
- Stack Overflow RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:2 or
- Stack Overflow Pytorch : Expected all tensors on same device.
Juypter Notebook server
I am on a Jupyter Notebook server, therefore, bash code starts with "!".
You have to begin with the following line in the same Jupyter Notebook cell in which you build the model. Mind that it does not seem to work if you change the environment variable with
import os and then os.environ['CUDA_VISIBLE_DEVICES'] = '6,3,7,2', and others ran into the same, see setting CUDA_VISIBLE_DEVICES just has no effect #9158.
!export CUDA_VISIBLE_DEVICES='6,3,7,2'
After changing the model to a DataParallel model, memory should then be spread among GPUs 4,5,6,7. If you ask why the code lists 6,3,7,2 even though it will then work on 4,5,6,7, see "model.to('cuda:6')" becomes (nvidia-smi) GPU 4, same with any other "cuda:MY_GPU", only "cuda:0" becomes GPU 0. How do I get rid of this mapping?.
*The outcome is also without GPU 7 (just 4,5,6), perhaps since it was not needed, and it is not the question since my main aim is to avoid the GPUs 1,2,3 since these are needed for another project. I would also like to spare GPU 0 so that I have only 4,5,6,7, but that is not an urgent need. In short: if you do not face the same mapping problem (6,3,7,2 -> 4,5,6,7), go on with your working setup, but if you have it, check the other link. The right mapping is not the question here.
Code
Main model (run on some chosen GPUs)
Here is how I build the model.
!export CUDA_VISIBLE_DEVICES='6,3,7,2'
from transformers import (AutoTokenizer, AutoModelForCausalLM, AutoConfig, TextDataset,
DataCollatorForLanguageModeling, Trainer, TrainingArguments)
def get_model(model_name):
#### from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
return tokenizer, model
import torch
device = torch.device('cuda:6')
print(torch.cuda.device_count())
torch.cuda.set_device(device)
model_name = "dbmdz/german-gpt2"
tokenizer, model = get_model(model_name)
config = model.config
print(next(model.parameters()).device)
device_ids = [6,3,7,2]
model = torch.nn.DataParallel(model, device_ids=device_ids)
print(model.device_ids)
print(next(model.parameters()).device)
Out:
8
cpu
[6, 3, 7, 2]
cuda:0
cpu
Fine-tuning model with the Transformers Trainer class
def make_finetuned_model(tokenizer, model, file_path='myfile.txt', model_name="fine-tuned-model", bln_truncation=True,
num_train_epochs=1, per_device_train_batch_size=1, save_steps=10_000):
train_dataset = TextDataset(
tokenizer=tokenizer,
file_path=file_path,
block_size=512,
overwrite_cache=True,
)
print(next(model.parameters()).device)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
print(next(model.parameters()).device)
model_folder = f"./{model_name}"
# Define the Trainer
trainer = Trainer(
model=model.to('cuda:6'), # "model=model" should run through as well
args=TrainingArguments(
output_dir=model_folder,
overwrite_output_dir=True,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
save_steps=save_steps,
),
data_collator=data_collator,
train_dataset=train_dataset,
)
model.to('cuda:6')
print(next(model.parameters()).device)
# Fine-tune the model
trainer.train()
# Save the model and tokenizer to the fine-tuned model directory
# This is needed since the model config and tokenizer need to be loaded at any load
# Since the fine-tuned model wrapped with DataParallel, save the underlying model with:
model.module.save_pretrained(model_folder)
tokenizer.save_pretrained(model_folder)
make_finetuned_model(tokenizer, model, file_path='myfile.txt',
model_name="fine_tuned_model", bln_truncation=True,
num_train_epochs=1, per_device_train_batch_size=1, save_steps=10_000)
Out:
cuda:6
cuda:6
cuda:0
cuda:0
Thus, building the trainer object resets the device to "cuda:0" no matter what you wrote, see the third printout cuda:0 after it has been cuda:6 before. I checked the Huggingface thread Setting specific device for Trainer which is open and busy since August 2022 (!).
Since I chose four other GPUs and GPU 0 comes on top, the error is thrown.
Question
The Transformers Trainer class will always set the device to "cuda:0". How do I get rid of the errors:
RuntimeError: module must have its parameters and buffers on device cuda:6 (device_ids[0]) but found one of them on device: cuda:0
and during the same code work, but more seldomly, and I do not know how to get this error back, code is lost:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:4! (when checking argument for argument index in method wrapper_CUDA__index_select)