Aim
I want to fine-tune a text generation model with essays of changing size and then ask each of these input texts a few questions. I already have a wider range of question-answer pairs at hand for each essay, which should be enough to make a first prototype. Yet, my aim is not to get the answers that I fed during training. This is not to build a chat for a bank client who may need such a clear and learnt answer. This is more about a free speech opinion, judgement, understanding that I myself might have missed while I read the text.
Example: do I need 50 models to get answers for each essay?
Thus, if I have 50 essays and each of them has 5 questions with 5 answers during training, I cannot just train all of them with all question-answer pairs. If I ask one essay a question, I do not want to get the answer that it already knows. I want to get a new answer, as if the model had never seen the answer during training. I can train the model with all essays, but not with all question-answer pairs. I have to train the model with all question-answer pairs but the one that I want to ask questions to for which I can feed the model only with the questions, and not with the answers. Only then, I get free answers to those questions while it still generalizes from all the 49 other answers of the dataset. If the answers were known during training, it would just give me the answers that it knows from training.
My aim is to get new answers from a model that does not know the answers but tries to find them from the given essay and the generalization of 49 other essays and their question-answer pairs.
Essay 1:
- Question 1 of 5: What is the plot of the main person.
- Answer: Santa Claus is stressed and tries to skip Christmas. By chance, he brings the world the most comfortable Christmas ever.
Essay 2:
- Question 1 of 5: What is the plot of the main person.
- Answer: Frank Franklin is a wildlife activist who gets almost shot by a jungle company but survives and fights back.
Essay 3:
- Question 1 of 5: What is the historical background of the story.
- Answer: The story plays in the 19th century during the upcoming industrialization when there was a boom that made some people rich in a short time through the first stocks markets and speculation.
Essay 4:
...Essay 50:
...
Now When I train with essay 1 to 50, and I take essay 2 and want to get answers in free speech, I should train essay 1 and essay 3 - 50 with question-answer pairs while I would take essay 2 only with the 5 questions (without the 5 answers, so that they will be open answers if I ask!):
Essay 1:
- Question 1 of 5: What is the plot of the main person.
- Answer: Santa Claus is stressed and tries to skip Christmas. By chance, he brings the world the most comfortable Christmas ever.
Essay 2:
- Question 1 of 5: What is the plot of the main person.
- NO ANSWER HERE so that the model will answer it during fine-tuning
Essay 3:
- Question 1 of 5: What is the historical background of the story.
- Answer: The story plays in the 19th century during the upcoming industrialization when there was a boom that made some people rich in a short time through the first stocks markets and speculation.
Essay 4:
...Essay 50:
...
But I wonder whether there is a way to train just one model that can answer everything in free speech and as if it had not seen its own answers but only the answers of all other essays' questions.
If I trained the whole model again with the rest of the essays with all question-answer pairs, and if I did not give the answers to the one essay that I want to ask questions to, then I would have to train the fine-tuning model each time I change the essay, which is quite a waste of energy and machine time.
Tweaking the text generation model
I tried a text generation model (german-gpt2) and fed it with just one chosen essay and its 5 questions. This very small fine-tuning model had bad answers (trained without the answers). One essay is clearly not enough to have a generalizing text generation model.
I made the same text generation model, but then without the questions, and when I then asked it to write text after the prompt, the new text was not good enough, mostly too abstract or too far away from the essay, and a bit weird.
Should I add
eos_tokenas argument of thetokenizer.encode_plus()and also "end of sentence" [EOS] tokens in the input text as well? Would that make the model any better? Does the model give better answers when there are padding [PAD] tokens aspad_tokenargument of thetokenizer.encode_plus()? Which other tweaks and tricks should give better answers?
What could help the most to get a better text generation? Up to now, the text output of the text generation model is not good.
Fine-tuning code with just one file as the text input
I train the text generation model with the code that you find at How can you get a Huggingface fine-tuning model with the Trainer class from your own text where you can set the arguments for truncation and padding?:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from transformers import AutoTokenizer
from datasets import load_dataset
model_name = "dbmdz/german-gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
file_path = './myfile.txt'
bln_truncation = False
num_train_epochs = 1
per_device_train_batch_size = 1
save_steps = 10_000
dataset = load_dataset("text", data_files={"train": file_path})
block_size = 512
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
return tokenizer(
examples["text"], padding="max_length", truncation=bln_truncation)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
And here begins the fine-tuning with the transformer PyTorch Trainer class that seems to be first choice on Huggingface, see Train with PyTorch Trainer.
model_folder = f"./{model_name}"
training_args = TrainingArguments(
output_dir=model_folder,
overwrite_output_dir=True,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
save_steps=save_steps,
)
model = AutoModelForCausalLM.from_pretrained(model_name)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=tokenized_datasets["train"],
)
trainer.train()
model.module.save_pretrained(model_folder)
tokenizer.save_pretrained(model_folder)
Question
How do I get a good answer, at best in free speech, to a question about an essay, with the generalized knowledge of all other essays and their question-answer pairs, but without the knowledge of the answers of that chosen essay?
How should I set up the training or model if I want to get answers for each essay, but always with the training knowledge of the whole input of all essays and all question-answer pairs but without the answers of the chosen essay that I want to ask questions about and only within the boundaries of that chosen essay? Do I have to train 50 models if I have 50 essays?
Other models
There might be better models to reach this aim, I read of Retrieval-Augmented Generation (RAG) models at How does fine-tuning work in question answering for custom documents. But that question is already asked, so I do not want to make a duplicate here.
I also tried a Question Answering model, but it answers with cut text from the essay, thus, not in free speech, at least if I train it with just one essay. It might generalize better with more input. But such a question is already asked at Fine-tuning a pre-trained LLM for question-answering, and I do not want to make a duplicate here.