Questions tagged [attention-mechanism]

160 questions
109
votes
4 answers

What is the positional encoding in the transformer model?

I'm trying to read and understand the paper Attention is all you need and in it, there is a picture: I don't know what positional encoding is. by listening to some youtube videos I've found out that it is an embedding having both meaning and…
Peyman
  • 1,235
  • 2
  • 9
  • 8
41
votes
8 answers

In a Transformer model, why does one sum positional encoding to the embedding rather than concatenate it?

While reviewing the Transformer architecture, I realized something I didn't expect, which is that : the positional encoding is summed to the word embeddings rather than concatenated to…
FremyCompany
  • 523
  • 1
  • 4
  • 7
40
votes
3 answers

What's the difference between Attention vs Self-Attention? What problems does each other solve that the other can't?

As stated in the question above..is there a difference between attention and self attention mechanism ? Also additionally can anybody share with me tips and tricks about how self attention mechanism can be implemented in CNN?
Pratik.S
  • 533
  • 1
  • 5
  • 10
36
votes
4 answers

Gumbel-Softmax trick vs Softmax with temperature

From what I understand, the Gumbel-Softmax trick is a technique that enables us to sample discrete random variables, in a way that is differentiable (and therefore suited for end-to-end deep learning). Many papers and articles describe it as a way…
4-bit
  • 461
  • 1
  • 4
  • 3
29
votes
7 answers

Why is the decoder not a part of BERT architecture?

I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do…
hathalye7
  • 445
  • 1
  • 5
  • 7
21
votes
1 answer

Can BERT do the next-word-predict task?

As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?
17
votes
3 answers

How does attention mechanism learn?

I know how to build an attention in neural networks. But I don’t understand how attention layers learn the weights that pay attention to some specific embedding. I have this question because I’m tackling a NLP task using attention layer. I believe…
15
votes
2 answers

Class token in ViT and BERT

I'm trying to understand the architecture of the ViT Paper, and noticed they use a CLASS token like in BERT. To the best of my understanding this token is used to gather knowledge of the entire class, and is then solely used to predict the class of…
13
votes
2 answers

Why does the transformer positional encoding use both sine and cosine?

In the transformer architecture they use positional encoding (explained in this answer and I get how it is constructed. I am wondering why it needs to use both sine and cosine though instead of just one or the other?
Joff
  • 263
  • 2
  • 6
13
votes
2 answers

Variable input/output length for Transformer

I was reading the paper "Attention is all you need" (https://arxiv.org/pdf/1706.03762.pdf ) and came across this site http://jalammar.github.io/illustrated-transformer/ which provided a great breakdown of the architecture of the Transformer.…
Sean Lee
  • 251
  • 2
  • 8
12
votes
4 answers

Transformer model: Why are word embeddings scaled before adding positional encodings?

While going over a Tensorflow tutorial for the Transformer model I realized that their implementation of the Encoder layer (and the Decoder) scales word embeddings by sqrt of embedding dimension before adding positional encodings. Notice that this…
Milad Shahidi
  • 413
  • 4
  • 9
11
votes
2 answers

How do attention mechanisms in RNNs learn weights for a variable length input

Attention mechanisms in RNNs are reasonably common to sequence to sequence models. I understand that the decoder learns a weight vector $\alpha$ which is applied as a weighted sum of the output vectors from the encoder network. This is used to…
10
votes
4 answers

How are Q, K, and V Vectors Trained in a Transformer Self-Attention?

I am new to transformers, so this may be a silly question, but I was reading about transformers and how they use attention, and it involves the usage of three special vectors. Most articles say that one will understand their purpose after reading…
9
votes
2 answers

Cross-attention mask in Transformers

I can't fully understand how we should create the mask for the decoder's cross-attention mask in the original Transformer model from Attention Is All You Need. Here is my attempt at finding a solution: Suppose we are training such Transformer model,…
7
votes
2 answers

Transformer-based architectures for regression tasks

As far as I've seen, transformer-based architectures are always trained with classification tasks (one-hot text tokens for example). Are you aware of any architectures using attention and solving regression tasks? Could one build a regressive…
1
2 3
10 11