14

It could be that I'm misunderstanding the problems space and the iterations of LLAMA, GPT, and PaLM are all based on BERT like many language models are, but every time I see a new paper in improving language models it takes BERT as a based an adds some kind of fine-tuning or filtering or something. I don't understand why BERT became the default in research circles when all anyone hears about publicly is GPT-2,3,4 or more recently LLAMA-2. I have a feeling it has something to do with BERT being open-source, but that can't be the whole story. This question might not be specific enough, please let me know. Thanks.

Ethan
  • 243
  • 1
  • 2
  • 6

3 Answers3

13

There are many contributing factors to the abundance of research based on BERT vs the research based on Llama:

  • Age: BERT has been around for far longer than Llama (2018 vs 2023), so it has more traction with researchers because it has been applied to many many things, so people know they work and has probably already been applied to a problem similar to yours.
  • Computational resources: BERT is lightweight compared to Llama. Anyone can train BERT on a single medium-range GPU. To use Llama for inference you need a lot of very powerful GPUs, let alone training it. Most research groups have modest computational resources.
  • Appropriateness for downstream tasks: BERT is easily applied to text classification because it has the output at the [CLS] token position, which can be directly attached a classification head. Llama is an autoregressive language model, which makes it less obvious how to use it for classification. Of course, you can just approach the task at the natural language level and ask Llama to classify the input text, but the reliability of this kind of approach is not 100% and the model may just answer with their diatribe about why it's not Ok to proceed with your request unless you explicitly constrain its output (e.g. with grammars). On the other hand, BERT is not meant as a generative model, you you'd better not use it for a generative task.
noe
  • 28,203
  • 1
  • 49
  • 83
4

Adding/complementing the other answers, BERT gives the possibility to access/obtain the embeddings of the fed input (which wasn't and still isn't the case of some other models). The embeddings are often interpreted as the "perception" of the model. They are very useful in multiple research studies, and you may find many that are reasoning on the embeddings distances between different text inputs (e.g. the cosine distance) to detect or measure an aspect related to the fed text or the model itself.

e.g. given the following sentences,

A = "the man earns 10 dollars",

B = "the woman earns 10 dollars",

C = "the man is broke" and

D = "the woman is broke"

when fed to a financial language model, we expect that the embeddings of A and B are close, similarly those of C and D, indicating that the attention is focused on what is earned by each individual and not its gender. Otherwise, if for instance B and D are very close, or B is closer to C and D than A is close to C and D, we may think of a gender bias of the model...

Although introduced very recently, we're seeing much research using LLaMa 2. However, some say that its embeddings can be "meaningless" (I don't agree with that and you can read more on this topic here https://stackoverflow.com/a/77441536/3014036)... GPT gives access to its embeddings but it's slow and expensive, so you can imagine that anybody making a large-scale study would use BERT instead... Overall, I agree with the main reason given in the accepted answer; BERT is lightweight while LLaMa and GPT are both very expensive and way slower than BERT...

3

Although LLM's like GPT-3 and LLAMA have gain public attention due to marketing, BERT is the foundation of all Large Language Models being open-source and the first one to base on transformer architecture.

Also BERT's bidirectional context-aware embeddings allowed it to capture rich contextual information from both left and right contexts of a word. This property made BERT versatile and effective across a wide range of NLP tasks.

These are the reasons why in researchers use BERT instead of other LLMs