Understanding the Training routine of the Transformer architecture

Question

I have been thinking about the Masking in the Self attention of the decoder in the context of the training for a long time and doesn't really make sense to me. I have browsed through a lot of sources and they didn't help.

Given a target and source sentence for translation (english to german):

source: [I, am, home]
target: [Ich, bin, daheim]

The source sentence is then fed into the decoder, where we apply self-attention and the output matrix will be our value matrix $V$ and our key matrix $K$ in the cross-attention block of our decoder.

For the input of the decoder we will shift the entire target sentence to the right and add an $<eos>$ to the beginning: [$<eos>$, Ich, bin]

Now comes the part that I have trouble with: When we train, we do a single forward pass of this training example. According to my understanding masking allows us to train with all susequent sequences of our training example: [$<eos>$, Ich], [$<eos>$]

The Output of the decoder masked self-attention will be a matrix $X$, where each row represents the last word in our sequence paying attention to all previous words. So do we just use this one specific row to make our next prediction or do we use the row and every row above. To make this more clear:

example: $x_{2}$ represents the result of the attention process where "Ich" pays attention to itself and $<eos>$ for the subsequence [$<eos>$, Ich]. So we also need $x_1$ to predict the next word "bin" or how does this work. I am just very confused about the precise training routine and would be very glad about a clear description.

Understanding the Training routine of the Transformer architecture

0 Answers0