When I generate a src_mask like this
mask = torch.triu(
torch.ones(batch_size, batch_size).bool(),
diagonal=0
)
>> tensor([[ True, True, True, True, True],
[False, True, True, True, True],
[False, False, True, True, True],
[False, False, False, True, True],
[False, False, False, False, True]])
then the transformer only generates NaN values. If I change diagonal=1 it works, but I don't really understand why. The goal of the mask is to prevent the transformer from paying attention to to any sample after the current sample (and I want to increase the amount of masked values later) for the model to predict further into the future.