2

Hi im just getting started with understanding transformer based models and I am not able to find how the token embeddings are arrived at?. there are multiple tokenization approaches and multiple vocabularies/documents llms are trained on. so my question is

  1. whether each llm also trains its own token embeddings?
  2. how do those pre trained embeddings work for transfer learning or fine tuning, on custom data sets where some OOV words may be present or we have some special unique tokens we want to keep whole and not have tokenizer do subword tokens?
dasman
  • 121
  • 1
  • 3

2 Answers2

4

First, the token vocabulary is extracted from the training data, usually by means of byte-pair encoding (BPE), wordpieces or unigrams. Then, regarding model definition, the first layer in LLMs is the token embeddings, which are trained with the rest of the model.

Given that subword-level vocabularies mitigate the OOV problem, there are no unknown tokens, so usually there is no need to address such a problem. It is not common either to have special tokens that need to be kept unsplit.

In general, it is inconvenient to modify the vocabulary of a pretrained model, because you need to retrain/finetune the model to make it able to handle the new tokens, and therefore you lose the advantages of reusing the pretrained representations; furthermore, the fine-tuning data may not be enough to deliver good representations for new tokens.

noe
  • 28,203
  • 1
  • 49
  • 83
3
  1. Yes, each model has a different tokenizer type. You can see the list on the different types of tokenizers in the documentation for AutoTokenizer
  2. You can use the same type of tokenizer as the 'flavor' of model your transfer learning is based upon and train your own tokenizer. See the video here: Training a new tokenizer from an old one.

Regarding OOV tokens, there is also utility method that might be worth looking into, add_token, though it appears there are some nuances to be considered there.