Highest Voted 'attention-mechanism' Questions

109

votes

4 answers

What is the positional encoding in the transformer model?

I'm trying to read and understand the paper Attention is all you need and in it, there is a picture: I don't know what positional encoding is. by listening to some youtube videos I've found out that it is an embedding having both meaning and…

asked Apr 28 '19 at 14:43

Peyman

1,235
2
9
8

41

votes

8 answers

In a Transformer model, why does one sum positional encoding to the embedding rather than concatenate it?

While reviewing the Transformer architecture, I realized something I didn't expect, which is that : the positional encoding is summed to the word embeddings rather than concatenated to…

nlp encoding transformer attention-mechanism

asked Jul 18 '19 at 08:34

FremyCompany

523
1
4
7

40

votes

3 answers

What's the difference between Attention vs Self-Attention? What problems does each other solve that the other can't?

As stated in the question above..is there a difference between attention and self attention mechanism ? Also additionally can anybody share with me tips and tricks about how self attention mechanism can be implemented in CNN?

cnn attention-mechanism

asked Apr 17 '19 at 10:39

Pratik.S

533
1
5
10

36

votes

4 answers

Gumbel-Softmax trick vs Softmax with temperature

From what I understand, the Gumbel-Softmax trick is a technique that enables us to sample discrete random variables, in a way that is differentiable (and therefore suited for end-to-end deep learning). Many papers and articles describe it as a way…

neural-network deep-learning attention-mechanism softmax

asked Aug 29 '19 at 10:30

4-bit

461
1
4
3

29

votes

7 answers

Why is the decoder not a part of BERT architecture?

I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do…

nlp bert machine-translation attention-mechanism

asked Dec 21 '19 at 17:09

hathalye7

445
1
5
7

21

votes

1 answer

Can BERT do the next-word-predict task?

As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?

neural-network deep-learning attention-mechanism transformer bert

asked Feb 28 '19 at 08:37

CoderOnly

721
1
7
17

17

votes

3 answers

How does attention mechanism learn?

I know how to build an attention in neural networks. But I don’t understand how attention layers learn the weights that pay attention to some specific embedding. I have this question because I’m tackling a NLP task using attention layer. I believe…

neural-network deep-learning nlp attention-mechanism

asked Jan 23 '20 at 06:05

user2790103

273
2
5

15

votes

2 answers

Class token in ViT and BERT

I'm trying to understand the architecture of the ViT Paper, and noticed they use a CLASS token like in BERT. To the best of my understanding this token is used to gather knowledge of the entire class, and is then solely used to predict the class of…

machine-learning deep-learning nlp computer-vision attention-mechanism

asked Mar 14 '21 at 18:01

Shir

251
1
2
5

13

votes

2 answers

Why does the transformer positional encoding use both sine and cosine?

In the transformer architecture they use positional encoding (explained in this answer and I get how it is constructed. I am wondering why it needs to use both sine and cosine though instead of just one or the other?

machine-learning nlp transformer attention-mechanism

asked Feb 23 '20 at 12:54

Joff

263
2
6

13

votes

2 answers

Variable input/output length for Transformer

I was reading the paper "Attention is all you need" (https://arxiv.org/pdf/1706.03762.pdf ) and came across this site http://jalammar.github.io/illustrated-transformer/ which provided a great breakdown of the architecture of the Transformer.…

nlp attention-mechanism

asked Feb 13 '19 at 03:43

Sean Lee

251
2
8

12

votes

4 answers

Transformer model: Why are word embeddings scaled before adding positional encodings?

While going over a Tensorflow tutorial for the Transformer model I realized that their implementation of the Encoder layer (and the Decoder) scales word embeddings by sqrt of embedding dimension before adding positional encodings. Notice that this…

tensorflow nlp transformer attention-mechanism

asked Jan 13 '21 at 10:10

Milad Shahidi

413
4
9

11

votes

2 answers

How do attention mechanisms in RNNs learn weights for a variable length input

Attention mechanisms in RNNs are reasonably common to sequence to sequence models. I understand that the decoder learns a weight vector $\alpha$ which is applied as a weighted sum of the output vectors from the encoder network. This is used to…

neural-network rnn sequence-to-sequence attention-mechanism

asked Jan 30 '18 at 00:35

davidparks21

433
1
4
18

10

votes

4 answers

How are Q, K, and V Vectors Trained in a Transformer Self-Attention?

I am new to transformers, so this may be a silly question, but I was reading about transformers and how they use attention, and it involves the usage of three special vectors. Most articles say that one will understand their purpose after reading…

machine-learning nlp sequence-to-sequence transformer attention-mechanism

asked Feb 17 '20 at 09:55

arctic_hen7

201
1
2
3

9

votes

2 answers

Cross-attention mask in Transformers

I can't fully understand how we should create the mask for the decoder's cross-attention mask in the original Transformer model from Attention Is All You Need. Here is my attempt at finding a solution: Suppose we are training such Transformer model,…

nlp transformer attention-mechanism masking

asked Dec 27 '23 at 15:44

ИванКарамазов

230
2
9

7

votes

2 answers

Transformer-based architectures for regression tasks

As far as I've seen, transformer-based architectures are always trained with classification tasks (one-hot text tokens for example). Are you aware of any architectures using attention and solving regression tasks? Could one build a regressive…

regression autoencoder transformer attention-mechanism

asked May 26 '20 at 18:03

Damjan Dakic

173
1
4

Questions tagged [attention-mechanism]