In position encoding of the transformer, we usually use a sinusoidal encoding rather than a binary encoding even though a binary encoding could successfully capture the positional information very similar to a sinusoidal encoding (with multiple values of i for position closeness)
- though, I understand that the sinusoidal wrapper is continuous and yields certain benefits. What I do not understand is why do we use the term we use inside the sin and cosine wrappers.
pos/10000^(2i/d)
why do we have to use this ? isn't there any other simplified function that can be used around sin and cosine that shows positional (both near and far) difference as i is changed ?
- why do we have to use sin and cosine wrappers at all instead of some other continuous functions that accurately captures the positional information. I know that using sin and cosine wrappers has some trigonometric properties that makes sure a position vector can be represented as a linear transformation of another position vector. But this does seem pretty irrelevant since this property is not used by the encoder or in self-attention anywhere. I understand that the information of the position is implicitly taken into account by the encoder but nowhere is the trigonometric property is used. It seems not necessary to me. Am I missing something ?