I was reading the swin transformer paper and looking at the github implementation, i noticed that when calculating the relative position bias the input to the log function before the CPB MLP is scaled to a range 0 to 8. I couldn't see mention of this in the original paper my intuition is that this will give output in the range 0 to 1 I was wondering if this was the correct reasoning?
However, i also noticed that the output of the MLP is passed through a sigmoid function before being scaled by a factor of 16. I also couldn't find this being mentioned in the paper and was wondering what the underlying reasoning is?
Also I have just noticed that the logit scale parameter is intialised to ln(10) is there a reason for this?
Thankyou for any assistance.