Getting a free and unknown answer to a question against a fine-tuned text generation model trained on many essays and their few questions and answers

Question

Aim

I want to fine-tune a text generation model with essays of changing size and then ask each of these input texts a few questions. I already have a wider range of question-answer pairs at hand for each essay, which should be enough to make a first prototype. Yet, my aim is not to get the answers that I fed during training. This is not to build a chat for a bank client who may need such a clear and learnt answer. This is more about a free speech opinion, judgement, understanding that I myself might have missed while I read the text.

Example: do I need 50 models to get answers for each essay?

Thus, if I have 50 essays and each of them has 5 questions with 5 answers during training, I cannot just train all of them with all question-answer pairs. If I ask one essay a question, I do not want to get the answer that it already knows. I want to get a new answer, as if the model had never seen the answer during training. I can train the model with all essays, but not with all question-answer pairs. I have to train the model with all question-answer pairs but the one that I want to ask questions to for which I can feed the model only with the questions, and not with the answers. Only then, I get free answers to those questions while it still generalizes from all the 49 other answers of the dataset. If the answers were known during training, it would just give me the answers that it knows from training.

My aim is to get new answers from a model that does not know the answers but tries to find them from the given essay and the generalization of 49 other essays and their question-answer pairs.

Essay 1:

Question 1 of 5: What is the plot of the main person.

Answer: Santa Claus is stressed and tries to skip Christmas. By chance, he brings the world the most comfortable Christmas ever.

Essay 2:

Question 1 of 5: What is the plot of the main person.

Answer: Frank Franklin is a wildlife activist who gets almost shot by a jungle company but survives and fights back.

Essay 3:

Question 1 of 5: What is the historical background of the story.

Answer: The story plays in the 19th century during the upcoming industrialization when there was a boom that made some people rich in a short time through the first stocks markets and speculation.

Essay 4:
...

Essay 50:
...

Now When I train with essay 1 to 50, and I take essay 2 and want to get answers in free speech, I should train essay 1 and essay 3 - 50 with question-answer pairs while I would take essay 2 only with the 5 questions (without the 5 answers, so that they will be open answers if I ask!):

Essay 1:

Question 1 of 5: What is the plot of the main person.

Answer: Santa Claus is stressed and tries to skip Christmas. By chance, he brings the world the most comfortable Christmas ever.

Essay 2:

Question 1 of 5: What is the plot of the main person.

NO ANSWER HERE so that the model will answer it during fine-tuning

Essay 3:

Question 1 of 5: What is the historical background of the story.

Answer: The story plays in the 19th century during the upcoming industrialization when there was a boom that made some people rich in a short time through the first stocks markets and speculation.

Essay 4:
...

Essay 50:
...

But I wonder whether there is a way to train just one model that can answer everything in free speech and as if it had not seen its own answers but only the answers of all other essays' questions.

If I trained the whole model again with the rest of the essays with all question-answer pairs, and if I did not give the answers to the one essay that I want to ask questions to, then I would have to train the fine-tuning model each time I change the essay, which is quite a waste of energy and machine time.

Tweaking the text generation model

I tried a text generation model (german-gpt2) and fed it with just one chosen essay and its 5 questions. This very small fine-tuning model had bad answers (trained without the answers). One essay is clearly not enough to have a generalizing text generation model.
I made the same text generation model, but then without the questions, and when I then asked it to write text after the prompt, the new text was not good enough, mostly too abstract or too far away from the essay, and a bit weird.
Should I add eos_token as argument of the tokenizer.encode_plus() and also "end of sentence" [EOS] tokens in the input text as well? Would that make the model any better? Does the model give better answers when there are padding [PAD] tokens as pad_token argument of the tokenizer.encode_plus()? Which other tweaks and tricks should give better answers?

What could help the most to get a better text generation? Up to now, the text output of the text generation model is not good.

Fine-tuning code with just one file as the text input

I train the text generation model with the code that you find at How can you get a Huggingface fine-tuning model with the Trainer class from your own text where you can set the arguments for truncation and padding?:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from transformers import AutoTokenizer
from datasets import load_dataset
model_name = "dbmdz/german-gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
file_path = './myfile.txt'
bln_truncation = False
num_train_epochs = 1
per_device_train_batch_size = 1
save_steps = 10_000
dataset = load_dataset("text", data_files={"train": file_path})
block_size = 512
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
    return tokenizer(
        examples["text"], padding="max_length", truncation=bln_truncation)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

And here begins the fine-tuning with the transformer PyTorch Trainer class that seems to be first choice on Huggingface, see Train with PyTorch Trainer.

model_folder = f"./{model_name}"
training_args = TrainingArguments(
    output_dir=model_folder,
    overwrite_output_dir=True,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    save_steps=save_steps,
)
model = AutoModelForCausalLM.from_pretrained(model_name)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
)
trainer.train()
model.module.save_pretrained(model_folder)
tokenizer.save_pretrained(model_folder)

Question

How do I get a good answer, at best in free speech, to a question about an essay, with the generalized knowledge of all other essays and their question-answer pairs, but without the knowledge of the answers of that chosen essay?

How should I set up the training or model if I want to get answers for each essay, but always with the training knowledge of the whole input of all essays and all question-answer pairs but without the answers of the chosen essay that I want to ask questions about and only within the boundaries of that chosen essay? Do I have to train 50 models if I have 50 essays?

Other models

There might be better models to reach this aim, I read of Retrieval-Augmented Generation (RAG) models at How does fine-tuning work in question answering for custom documents. But that question is already asked, so I do not want to make a duplicate here.
I also tried a Question Answering model, but it answers with cut text from the essay, thus, not in free speech, at least if I train it with just one essay. It might generalize better with more input. But such a question is already asked at Fine-tuning a pre-trained LLM for question-answering, and I do not want to make a duplicate here.

questionto42 · Accepted Answer · 2024-02-06T18:44:14.127

The model can be trained with all question answer pairs, but to get the model to answer in a new way to a question that it already knows depends on the scale of the input pairs.

If you train with only 10 pairs, the model will answer with what it already knows from the training. If you feed the model with 100 to 1000 pairs, you can trick it to answer in a free new way:

by rewording the question far enough so you leave the zone of known question answering pairs. You may take another LLM model to reach the rewording so that you do not need to do it by hand. In a small model, it would only answer what it has learnt from the input if you are lucky enough to hit one of those questions in the new wording. However, in a big model with 1000 pairs, it will generalize so far that also the answers will change. And that is would be the needed free answer of the question.
On top of that, and even more powerful, should be Prompt Engineering. Put the context (or its summary that you got by running a summary model on the context, idea is from GenAI SE How can I make sure that my private data is not in the future LLM training and there are no data leaks?) at the beginning of the message and the question, at best the reworded question, at the end of it. That should lead to an ever-changing outcome since the model can generalize at these counts to unknown questions on the same context.

See Fine-tuning a pre-trained LLM for question-answering with the link to the academic paper that seems to prove this.

Mind that I have not checked this myself, you better read that paper yourself or check your own large fine-tuning model, this answer is just a strong guess.