Highest Voted 'masking' Questions - Data Science Stack Exchange

9

votes

2 answers

Cross-attention mask in Transformers

I can't fully understand how we should create the mask for the decoder's cross-attention mask in the original Transformer model from Attention Is All You Need. Here is my attempt at finding a solution: Suppose we are training such Transformer model,…

asked Dec 27 '23 at 15:44

ИванКарамазов

230
2
9

7

votes

1 answer

Masking during transformer inference?

I send to following question to both ChatGPT and Deepseek. "Let's say we're not training the large language model, we are inferencing. The model already generated a sequence [A,B,C] and is about to predict next token D. The model needs to perform…

transformer attention-mechanism ai gpt masking

asked Mar 09 '25 at 20:24

OnCodeDeny

71
3

2

votes

1 answer

Decoder Transformer feedforward

I have a question about the decoder transformer feed forward during training. Let's pick an example: input data "i love the sun" traduction i want to predict (italian traduction) "io amo il sole". Now i feed the encoder with the input "i love the…

neural-network deep-learning transformer attention-mechanism masking

asked Mar 05 '21 at 10:40

erre4

95
7

2

votes

1 answer

Why shouldn't we mask [CLS] and [SEP] in preparing inputs for a MLM?

I know that MLM is trained for predicting the index of MASK token in the vocabulary list, and I also know that [CLS] stands for the beginning of the sentence and [SEP] telling the model the end of the sentence or another sentence will come soon, but…

nlp bert masking

asked Apr 18 '22 at 03:28

Jie

21
1

1

vote

1 answer

Dealing with high frequency tokens during masked Language modelling?

Suppose I am working with a Masked Language Model to pre-train on a specific dataset. In that dataset, most sequences have a particular token of a high frequency Sample Sequence:- , , , , , ---> here tok4 is very…

machine-learning language-model imbalanced-data masking

asked May 14 '21 at 17:19

neel g

227
1
5
11

1

vote

1 answer

Anonymize continuous variable for masking purposes

I am about to kick off a large hackathon event. We have a dataset that is comprised of one continuous variable with high precision, and a number of categorical variables qualifying these data 3-levels deep. Data provider wants to 'mask' the data…

masking transformation

asked Feb 03 '21 at 06:05

HEITZ

911
4
7

1

vote

0 answers

Pytorch Transformer only generating NaN when using mask

When I generate a src_mask like this mask = torch.triu( torch.ones(batch_size, batch_size).bool(), diagonal=0 ) >> tensor([[ True, True, True, True, True], [False, True, True, True, True], [False, False, …

pytorch transformer masking

asked Nov 12 '23 at 22:57

kot

11
1

0

votes

1 answer

There could be a problem with the linear layer after the attention inside a transformer?

My question regards this image: It seems that after the multi head attention there is a linear layer as they mention also from here: the linearity is given by the weights W^{o}. my quesion is: for the decoder, doesn't this linear layer mess up…

deep-learning neural-network transformer attention-mechanism masking

asked Mar 17 '21 at 14:48

erre4

95
7

0

votes

0 answers

Understanding the Training routine of the Transformer architecture

I have been thinking about the Masking in the Self attention of the decoder in the context of the training for a long time and doesn't really make sense to me. I have browsed through a lot of sources and they didn't help. Given a target and source…

training transformer attention-mechanism masking

asked Jan 15 '25 at 11:26

struggling_student

1

Questions tagged [masking]