Masking during transformer inference?

Question

I send to following question to both ChatGPT and Deepseek.

"Let's say we're not training the large language model, we are inferencing. The model already generated a sequence [A,B,C] and is about to predict next token D. The model needs to perform self-attention on the current sequence before generating D, if this is correct, since B and C are already generated, even though they are after A, will A attend to B and C during this self-attention?

Please answer with your knowledge, don't make any guess."

I got opposite answers.

ChatGPT: No, during inference (autoregressive decoding), token A will not attend to B and C when predicting token D. This is due to the causal mask used in self-attention.

Here’s why:

Causal Masking: Transformers use a lower-triangular mask during inference to ensure that each token only attends to itself and past tokens, preventing information leakage from future tokens. Autoregressive Decoding: At each step of generation, the model only has access to tokens generated so far. When predicting D, the model recomputes self-attention for [A, B, C], but each token can only see previous tokens. How It Works in This Case: When generating B, self-attention applies only to A. When generating C, self-attention applies to A and B. When generating D, self-attention applies to A, B, and C, but A still cannot attend to B or C—it retains the same original attention distribution it had when first processed. This constraint ensures the model respects autoregressive decoding and prevents future information from influencing past computations.

Deepseek: Yes, during inference, when the model has already generated the sequence [A, B, C] and is about to predict the next token D, it will perform self-attention over the entire sequence [A, B, C]. This means that A will attend to B and C, and similarly, B and C will attend to A and each other.

In self-attention, each token in the sequence attends to all other tokens in the sequence, including those that come after it. This is because self-attention is bidirectional by design in models like transformers, allowing each token to consider the context provided by all other tokens in the sequence, regardless of their position. This is true even during inference, where the model generates tokens one at a time in an autoregressive manner.

So, when predicting D, the model computes self-attention over [A, B, C], and A will indeed attend to B and C, as well as to itself. This allows the model to use the full context of the sequence to make the next prediction.

A human answer please? Thanks a lot.

score 6 · Answer 1 · answered Mar 28 '25 at 16:22

You probably got 2 seemingly contradicting answers because the question is a bit ambiguous. Though here ChatGPT is more correct than DeepSeek.

The question implies that the model is or has a decoder transformer as it refers to generating next word. But there are 3 big families of transformers: encoder-decoder (from the original paper "Attention is all you need"), encoder-only (like BERT) and decoder-only (like GPT). So 2 of these have one decoder but their architecture is not exactly the same.

If the question is about a decoder-only transformer ChatGPT is correct as decoder-only transformers have one self-attention layer (per block) which is causal i.e. it uses a triangular mask so that the model doesn't peek to the next tokens.

If the question is about an encoder-decoder transformer, both ChatGPT and DeepSeek are correct. Indeed, in this case there are 2 attention layers: first a causal self-attention like for the decoder-only transformer and then a second (cross-)attention layer which doesn't have a mask as the model uses the K and V matrices from the encoder and the Q matrix from the decoder. For translation, for example, which is a common application of encoder-decoder transformer each new word generation can look at all the words from the original sentence and so we don't need a mask.

As DeepSeek talks about bidirectional attention and autoregressive model, it probably assumed that the question is about encoder-decoder. The self-attention layer in encoder-only is also bidirectional by nature (to get the richest embedding of a word in a sentence and so needing the words before and after it) but it is the decoder which predicts next token (autoregressive behavior).
This said, DeepSeek writes that all the attention layers in transformers are bidirectional (so no triangular mask applied) which is not correct as seen above.

Masking during transformer inference?

1 Answers1