query, key and value interpretation in transformers ( encoder - decoder framework )

Question

I am implementing a custom algo inspired by NMT architecture BUT in the decoder, if Query = target language then the "value" should also be the same thing right ? Only the "key" should be the encoder output ( encoded source language ). After much heartburn i have made peace with the fact that a "query" is what you are trying to find out and the "key" is some sort of index FOR the "values", from which you choose your answer (based on the best score generated by the attention algo ). So if i want to convert French to English, and my encoder is encoding French, then my query is got to be in English and so the values must be in English right ? why does the TF code ( NMT tutorial ) take the French encoding as the KEY and the VALUE ??

OR is the interpretation that since the query ( masked input , English ) and the key ( encoded French ) are first "dot producted" ( sorry ) together, the "values" are in fact being learnt during the training based on the loss calculated by difference between input context ( English ), so far and the next predicted word ? and during inference the "key-values" are now in the form of a French-English dictionary ( a very smart dict at that which gives nearest word, based on context ) ?

noe · Answer 1 · 2023-06-27T08:00:52.147

Note three things:

The output of the encoder is not English but just what the decoder needs from the source sentence to generate the translation.
Only the first decoder layer receives the target language token embeddings. The following layers receive the output of the previous layer.
There are matrix multiplications before the dot products, which can project their inputs to completely different representation spaces.

So, to answer your question:

Neither of your interpretations is correct. The keys, values and queries are not in an "English representation space" nor in a "French representation space". Keys, vectors and queries are vectors in representation spaces that have been learned by the network during training. These representation spaces are not necessarily interpretable by a human, they were learned just to lower the loss at the task the model was trained in (i.e. to translate).

As an example of what I am trying to convey, please consider Transformer models trained for multilingual machine translation. These models can receive many different languages as input and generate translations in many different languages. These models learn to represent information in a way that makes it possible to translate properly between those languages (i.e. to minimize the loss they have been trained on). The same happens in a non-multilingual machine translation Transformer.

Actually, there are many scientific papers trying to understand what kind of information is encoded at the output of each layer in Transformer models.

query, key and value interpretation in transformers ( encoder - decoder framework )

1 Answers1

Linked