2

In position encoding of the transformer, we usually use a sinusoidal encoding rather than a binary encoding even though a binary encoding could successfully capture the positional information very similar to a sinusoidal encoding (with multiple values of i for position closeness)

  1. though, I understand that the sinusoidal wrapper is continuous and yields certain benefits. What I do not understand is why do we use the term we use inside the sin and cosine wrappers.

pos/10000^(2i/d)

why do we have to use this ? isn't there any other simplified function that can be used around sin and cosine that shows positional (both near and far) difference as i is changed ?

  1. why do we have to use sin and cosine wrappers at all instead of some other continuous functions that accurately captures the positional information. I know that using sin and cosine wrappers has some trigonometric properties that makes sure a position vector can be represented as a linear transformation of another position vector. But this does seem pretty irrelevant since this property is not used by the encoder or in self-attention anywhere. I understand that the information of the position is implicitly taken into account by the encoder but nowhere is the trigonometric property is used. It seems not necessary to me. Am I missing something ?

1 Answers1

5

First, consider why we need positional encoding in the first place. The attention operation is permutation invariant - computing self-attention on a sequence [a, b, c] yields the same results [b, c, a] (with the output order permuted of course). For natural language processing, we expect the order of words to matter. As a result, we need to give the model some way of distinguishing order so that [a, b, c] is seen as different from [b, c, a].

To your question of why use sin/cos instead of some other function, it really doesn't matter all that much. There are some nice properties of using sin/cos that I'll explain later on, but at a high level it doesn't really matter all that much. You just need some way to inject position into the input embeddings. You can use sin/cos, linear biases (ie ALiBi), RoPE, polynomial functions, learned positional embeddings, etc. They all accomplish the same goal.

Now lets look at sin/cos in more detail. The full equation is:

$$ PE_{(pos, 2i)} = sin(pos/10000^{2i/d}) \\ PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d}) $$

Or in pytorch:

max_len = 10000
d_model = 1024
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(max_len) / d_model))
pe = torch.zeros(max_len, 1, d_model)
pe[:, 0, 0::2] = torch.sin(position * div_term)
pe[:, 0, 1::2] = torch.cos(position * div_term)

Note that the positional encoding is a 2D matrix. Each position index is given a vector position encoding of length d_model, and each value in the position vector comes from a sin/cosine with a different frequency. The i value in the equation refers to the dimension in the positional embedding, not the position itself.

This means the position vector contains a mixture of values with different frequencies. The idea is the low frequency values capture global position differences, and high frequency values capture local position differences.

You mention using a binary encoding. Imagine something like this:

0: 0, 0, 0
1: 0, 0, 1
2: 0, 1, 0
3: 0, 1, 1
4: 1, 0, 0
5: 1, 0, 1
6: 1, 1, 0
7: 1, 1, 1

Note how the first column oscillates slowly, while the final column oscillates quickly. This is essentially what the sin/cos positional encoding is doing, just with floats rather than ints.

Using sin/cos has the benefit of producing position encoding values that are bounded, smooth, and can extrapolate beyond the max sequence length if needed. These properties make the sin/cos position embeddings work nicely for the task, but they are not theoretically necessary.

Karl
  • 1,176
  • 5
  • 7