Questions tagged [masking]
9 questions
9
votes
2 answers
Cross-attention mask in Transformers
I can't fully understand how we should create the mask for the decoder's cross-attention mask in the original Transformer model from Attention Is All You Need.
Here is my attempt at finding a solution:
Suppose we are training such Transformer model,…
ИванКарамазов
- 230
- 2
- 9
7
votes
1 answer
Masking during transformer inference?
I send to following question to both ChatGPT and Deepseek.
"Let's say we're not training the large language model, we are inferencing. The model already generated a sequence [A,B,C] and is about to predict next token D. The model needs to perform…
OnCodeDeny
- 71
- 3
2
votes
1 answer
Decoder Transformer feedforward
I have a question about the decoder transformer feed forward during training.
Let's pick an example: input data "i love the sun" traduction i want to predict (italian traduction) "io amo il sole".
Now i feed the encoder with the input "i love the…
erre4
- 95
- 7
2
votes
1 answer
Why shouldn't we mask [CLS] and [SEP] in preparing inputs for a MLM?
I know that MLM is trained for predicting the index of MASK token in the vocabulary list, and I also know that [CLS] stands for the beginning of the sentence and [SEP] telling the model the end of the sentence or another sentence will come soon, but…
Jie
- 21
- 1
1
vote
1 answer
Dealing with high frequency tokens during masked Language modelling?
Suppose I am working with a Masked Language Model to pre-train on a specific dataset. In that dataset, most sequences have a particular token of a high frequency
Sample Sequence:-
, , , , , ---> here tok4 is very…
neel g
- 227
- 1
- 5
- 11
1
vote
1 answer
Anonymize continuous variable for masking purposes
I am about to kick off a large hackathon event.
We have a dataset that is comprised of one continuous variable with high precision, and a number of categorical variables qualifying these data 3-levels deep.
Data provider wants to 'mask' the data…
HEITZ
- 911
- 4
- 7
1
vote
0 answers
Pytorch Transformer only generating NaN when using mask
When I generate a src_mask like this
mask = torch.triu(
torch.ones(batch_size, batch_size).bool(),
diagonal=0
)
>> tensor([[ True, True, True, True, True],
[False, True, True, True, True],
[False, False, …
kot
- 11
- 1
0
votes
1 answer
There could be a problem with the linear layer after the attention inside a transformer?
My question regards this image:
It seems that after the multi head attention there is a linear layer as they mention also from here:
the linearity is given by the weights W^{o}. my quesion is: for the decoder, doesn't this linear layer mess up…
erre4
- 95
- 7
0
votes
0 answers
Understanding the Training routine of the Transformer architecture
I have been thinking about the Masking in the Self attention of the decoder in the context of the training for a long time and doesn't really make sense to me. I have browsed through a lot of sources and they didn't help.
Given a target and source…