2

I'm trying to read and understand the paper Attention is all you need and in it, they used positional encoding with sin for even indices and cos for odd indices.

In the paper (Section 3.5), they mentioned

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.

My question is that if there is no recurrence, why not use One Hot Encoding. What is the advantage of using a sinusoidal positional encoding?

3 Answers3

5

You are mixing two different concepts in the same question:

  • One hot encoding: approach to encode $n$ discrete tokens by having an $n$-dimensional vectors with all 0's except one 1. This can be used to encode the tokens them selves in networks with discrete inputs, but only if $n$ is not very large, as the amount of memory needed is very large. Transformers (and most other NLP neural models) use embeddings, not one-hot encoding. With embeddings, you have a table with $n$ entries, each of them being a vector of dimensionality $e$. In order to represent token $k$, you select the $k$th entry in the embedding table. The embeddings are trained with the rest of the network in the task.
  • Positional encoding: in recurrent networks like LSTMs and GRUs, the network processes the input sequentially, token after token. The hidden state at position $t+1$ depends on the hidden state from position $t$. This way, the network has a means to identify the relative positions of each token by accumulating information. However, in the Transformer, there is no built-in notion of the sequence of tokens. Positional encodings are the way to solve this issue: you keep a separate embedding table with vectors. Instead of using the token to index the table, you use the position of the token. This way, the positional embedding table is much smaller than the token embedding table, normally containing a few hundred entries. For each token in the sequence, the input to the first attention layer is computed by adding up the token embedding entry and the positional embedding entry. Positional embeddings can either be trained with the rest of the network (just like token embeddings) or pre-computed by the sinusoidal formula from (Vaswani et al., 2017); having pre-computed positional embeddings leads to less trainable parameters with no loss in the resulting quality.

Therefore, there is no advantage of anyone over the other, as they are used for orthogonal purposes.

noe
  • 28,203
  • 1
  • 49
  • 83
3

The theoretical advantage should be that the network should be able to grasp the pattern from the encoding and thus generalize better for longer sentences. With one-hot position encoding, you would learn embeddings of earlier positions much more reliably than embeddings of later positions.

On the other hand paper on Convolutional Sequence to Sequence Learning published shortly before the Transformer uses one-hot encoding and learned embeddings for positions and it seems it does not make any harm there.

Jindřich
  • 1,809
  • 7
  • 9
1

By just using one-hot encoding you get an output in range 0..1, which is not normal distributed, as it is expected by the first linear layer with uniform(-1/sqrt(in_features), 1/sqrt(in_features)) initialization, so you might get longer training time and worse generalization due to input distribution and weight initialization.

On the other hand, why you should use one-hot, when you can actually just add a linear layer without bias term? This is an embedding layer, that is normal-initialized with mean 0 and std 1. By getting something at some index you get uniform-shaped activations, unlike one-hot. Embeddings layer also can be used to reduce redundant dimensionality explosion caused by one-hot encoding. You may want consider freezing weights of embeddings layer, as you don't gain much from two linear layers (input x -> embeddings -> linear) without non-linearity between embeddings layer and linear layer.

And what I can say about positional encoding in a feedforward neural network, which is BERT, GPT and BARD. Take a look how positional encoding is encoding your inputs:

x = x + pe

By this logic you can pre-generate pe (positional encoding) tensor out of random beta U-shaped distribution. You cannot put many features into one feature by just summing your token embedding with position embedding and with segment embedding. That's why even with segment embeddings BERT needed separation token.

Concluding, positional information in feedforward neural network is carried by the neuron itself. If you just throw out positional encoding from BERT architecture you would not lose in performance.

To prove it, there is two important papers I have just discovered:

But where positional emdebbings might work and improve performance? Of course in convolutional networks, where you have kernels, not neurons. Positional embeddings might get injected as a channel to the input.