1

I am currently trying to replicate or come close to the results achieved in the Multispider paper, which is a multilingual Text-to-SQL benchmark. I have downloaded the model used by the authors of the paper to create their results. This takes 9/10 minutes to run 1034 samples on eval and gets a English accuracy of 69% and German accuracy of 65%. This is to be expected.

They provide different configs for the all the languages which can be found here: https://github.com/longxudou/multispider/tree/main/configs/duorat

The model is a attention-based Text-to-SQL model called Rat-SQL.

I have trained a model myself over the last 4 days over 100k steps. The loss got down to below the rounding point of four zeros after comma. Which leads me to believe that the model did something right. But on the same eval as before I it takes 1,5h and yields 0% accuracy. Looking at the predictions there are some examples with no answer.

For the things that are notably different between my training process and theirs:

  1. I had to disable caching via @lru_cache(maxsize=None)
  2. I changed the batch size from 9 to 8
  3. The eval batch size for my training in German was 256 while the multilingual original is at 64

Most importantly is that for inference you need to provide the config that also is used for training the model. I have adapted the German config to a point where I could evaluate the German model. But using the German Config for the other model yields an error from pytorch about something being of wrong dimension, which didn't happen with my model so there is fundamentally something different in the size of both.

Currently I am trying to train a 10k steps model with a config for German that is equal in the points above except caching and training batch size to see if this would make a difference. After that I will try the same for the multilingual training.

Uwe
  • 11
  • 1

0 Answers0