Intuition Behind Model Training in General
When training a model on a dataset, it will minimize the loss specific to this dataset. That means it's optimizing for the patterns and structures it finds (in the current dataset only) by adapting the trainable parameters (aka weights) to minimize the loss, while starting either from scratch (usually random values) or with a weights from a already trained model. It does not care about what it 'learned' before, but tries to efficiently adapt the model parameters such that the loss gets smaller. It just takes this 'knowledge' as a starting point and keeps only what is helpful and necessary to reduce the loss. So this means it uses any parameters that were previously used to detect other patterns in the previous trainings to detect new patterns needed to reduce the loss for the new dataset. Note, that it effectively does not 'know' which of the parameters might be already used to remember any old patterns. It just observes the loss (difference between generated output vs the wanted output of the dataset).
A typical model operates in a high-dimensional parameter space, but to visualize the process, let’s simplify:
Assume the model only has two parameters, X and Y — like a 2D coordinate system. Each dataset defines a unique loss surface (like hills and valleys in this space). For any given position (X, Y), the height (Z-axis) represents the loss.
Training is the process of moving across this surface, guided by the loss gradient, to find a local or global minimum (i.e., the best parameter values that reduce error). Typically, the model will follow a relatively straight path towards the nearest optimum.
What Happens When Mixing Datasets?
If we mix Dataset 1 and Dataset 2, assuming equal probability sampling (e.g. both datasets are of equal size), the combined loss becomes: loss_combined(x, y) = 0.5 * loss1(x, y) + 0.5 * loss2(x, y). So now, the model isn't minimizing just loss1 or loss2, but the average loss of both datasets.
The new minimum does not necessarily be in the middle of the two, but e.g. could be where both loss function of dataset 1 and dataset 2 had a local minimum.
Note that the loss surface optimized for during training is always the average loss over randomly sampled entries from the training dataset at the given points, being the model parameter values in multi-dimensional space. This serves as an estimate of the true loss function. We usually don't know the true loss function, for which we would need all possible inputs and outputs. And we usually only can train on a subset of the often infinitely large dataset (this is what our inference data is drawn from) due to limited compute power and time anyways.
The smaller the dataset, the less accurate this estimate becomes, resulting in a "noisier" and more irregular loss surface.
Conversely, the larger the dataset, the smoother and more representative the estimated loss surface is, more closely reflecting the true underlying loss landscape during training.
Why Sequential Finetuning Can Fail
Suppose we first train on Dataset 1 and reach the optimum at (ox1, oy1).
Then, we finetune on Dataset 2 and the new optimum is at (ox2, oy2).
But the true optimum of the combined loss is at (ox3, oy3).
The problem:
Unless (ox3, oy3) lies directly on the path between the first two optima, it's unlikely the model will reach that point during sequential training. Even early stopping based on a validation set won’t necessarily steer it towards the combined optimum.
This is why finetuning often includes data from the original dataset — to keep the model generalizing over both.
Answers to Your Questions
Q1: How does Model 2 compare to Model 1?
Model 2, trained on the combined dataset, will generally perform better on data resembling the combined distribution.
Model 1, finetuned only on Dataset 2, might perform better on Dataset 2-like data but worse overall.
There's also a risk that Model 1 gets stuck in a local minimum of Dataset 1, making it suboptimal for Dataset 2 — especially if the learning rate is too low.
Since training is stochastic, results can vary. Unless the loss function has a dominant global minimum, we often settle for good-enough local optima.
Q2: What about Catastrophic Forgetting and Overfitting?
Catastrophic Forgetting
Yes, this happens especially if Dataset 2 is very different from Dataset 1.
Example:
Train a language model on Language A, then continue training on Language B — the model will likely "forget" Language A.
The more different the datasets are, the stronger this effect.
Overfitting
Related, but different.
If you monitor validation loss and stop training when it rises, overfitting is less likely.
Overfitting is mostly a risk when the dataset is small relative to model size.
Training on the combined dataset helps mitigate this, as it provides more diverse data.
On Data Distribution Shift
Differences in dataset composition can pull the model in different directions.
Example:
Dataset 1: 80% cats, 10% dogs, 10% birds
Dataset 2: 90% dogs, 10% cats, 0% birds
The model might learn shortcuts — e.g., always guessing "cat" for Dataset 1 and "dog" for Dataset 2 — without really learning image features. This is often the most efficient change for the model parameters to achieve a lower average loss value. After this adaption the model learns to further refine the estimate based one image features. But this takes much longer, since the back propagation will first adapt the last layers of any network which have a much higher influence on the loss.
When birds disappear from Dataset 2, the model may "forget" that birds exist, i.e. the average probability of birds in the outcome will quickly adapt towards 0% before the model adapts to the image features of Dataset 2. This would be catastrophic forgetting in action.
Final Thoughts
- Always try to train last on the data most similar to inference data.
- Training updates the model by following gradients (directions for parameter updates).
- Validation just ensures we don’t overtrain and start memorizing instead of generalizing.