Do we really need a very large dataset to train GPTs? If this dataset is not big, won't GPT work well? Or will it still work better than conventional learning models in this situation? And is it possible to quantitatively determine the minimum number of dataset samples suitable for this work? For example, if we talk about malware samples, we can say, for example, that the dataset suitable for GPTs should not be less than a certain number?
1 Answers
Train from scratch
If you want to train a LLM from scratch, then its a different ball game. In this case, yes you need a lot of data for training purposes. This is because when you fine tune a model (be it any model) that model already has learnt a lot of things from it's base training data(the data it was trained from scratch on). So the model already has a significant amount of vocabulary and information stored in it's weights. So you only need a small amount of custom data (data you are trying to get the model specialised on).
But in the case of training from scratch the model has no previous vocabulary or information. The weights have no information stored in them. So yes in this case, you would need a lot of data to make the model capable.
How much data you need is not fixed for any model. Again repeating the rule of thumb, train the model on as much data you have in order to get a solid robust base model. More data is always good!
Fine Tune LLM's
GPT's are Large Language Models(LLM). They are similar to normal or base language models, the only difference being the number of parameters in LLM is huge when comapred to normal/base language models.
This has the advantage of the models being accurate and more robust than normal language models. Another advantage of LLM's is that they require relatively less training data to train when compared to normal language models. So yes GPT's require less training data to train when compared to normal language models.
Some LLM's require only a few examples to predict accurately (google few shot learning). Some large models even require zero samples of data to predict(google zero shot learning). So depending on what the models you are using, you will require a few examples to zero examples to predict.
But the general rule of thumb is, the more data you have the better. Even for LLM's. So i would suggest you use all the data available to you to train your model.
Cheers!
- 2,223
- 2
- 14
- 37