I send to following question to both ChatGPT and Deepseek.
"Let's say we're not training the large language model, we are inferencing. The model already generated a sequence [A,B,C] and is about to predict next token D. The model needs to perform self-attention on the current sequence before generating D, if this is correct, since B and C are already generated, even though they are after A, will A attend to B and C during this self-attention?
Please answer with your knowledge, don't make any guess."
I got opposite answers.
ChatGPT: No, during inference (autoregressive decoding), token A will not attend to B and C when predicting token D. This is due to the causal mask used in self-attention.
Here’s why:
Causal Masking: Transformers use a lower-triangular mask during inference to ensure that each token only attends to itself and past tokens, preventing information leakage from future tokens. Autoregressive Decoding: At each step of generation, the model only has access to tokens generated so far. When predicting D, the model recomputes self-attention for [A, B, C], but each token can only see previous tokens. How It Works in This Case: When generating B, self-attention applies only to A. When generating C, self-attention applies to A and B. When generating D, self-attention applies to A, B, and C, but A still cannot attend to B or C—it retains the same original attention distribution it had when first processed. This constraint ensures the model respects autoregressive decoding and prevents future information from influencing past computations.
Deepseek: Yes, during inference, when the model has already generated the sequence [A, B, C] and is about to predict the next token D, it will perform self-attention over the entire sequence [A, B, C]. This means that A will attend to B and C, and similarly, B and C will attend to A and each other.
In self-attention, each token in the sequence attends to all other tokens in the sequence, including those that come after it. This is because self-attention is bidirectional by design in models like transformers, allowing each token to consider the context provided by all other tokens in the sequence, regardless of their position. This is true even during inference, where the model generates tokens one at a time in an autoregressive manner.
So, when predicting D, the model computes self-attention over [A, B, C], and A will indeed attend to B and C, as well as to itself. This allows the model to use the full context of the sequence to make the next prediction.
A human answer please? Thanks a lot.