How difficult are LLMs in terms of grammar?

Question

I'm currently trying to fine tune the Mistral 7b instruct v0.3 model in order to give him as an input a list of unordered tasks and have as an output the dependencies between those tasks in a format like A -> B, B -> C, C -> D...

I'm working in the industrial maintenance field, which means that my tasks are inked in this vocabulary (thus the fine tuning). Tasks are in french. The issue is that in my dataset, the same task is written in different ways, for example : purchasing item, purchasing items, purchasing of the item, purchasing filtering items...

I was thinking of standardising my whole dataset so there won't be this much noise, but i can't have a say on what the user's will put as the input, so cleaning the dataset could bias the model more than helping.

My question is : are LLMs capable of handling grammary modifications on a same sentences ? Do I have to improve my dataset so when i fine tune the model, it could learn that A1 and A2 are the same thing ? Or is there a better way to do it during inference such as using sentence transformers ?

score 3 · Answer 1 · answered Jun 30 '25 at 14:45

I'm not sure you even need to finetune here. I use LLMs a lot for information extraction of noisy data and find it works quite well for the most part. There are obviously times that it fails, but no amount of prompt engineering is going to get past that.

Depending what your data looks like, you could always instruct the LLM to extract information in a tabular way and have it output results in some json format. For example, you could have a prompt like this:

You are a an AI agent in the industrial maintenance field.
Your goal is to extract information from sentences related to: purchasing, <thing2>, <thing3>, etc.
The output should be in a structured json format as follows:
{
  'purchase': true/false/unknown,
  '<thing2>': ...,
  ...
}

Because of how LLMs are trained, they shouldn't have issues with this simple grammar swapping. And in this way you can construct a json format if you want that information for each data point.

Since you mentioned sentence transformers, you could also use a BERT style model to perform feature probing. This requires a bit more engineering, and you'd want to test for thresholds for comparison. One way to do this is to have your sentences: <sent1>, <sent2>, ... and the features you'd like to probe for <feature1>, <feature2>, .... Then, for each sentence, you compute the embedding of that sentence. This could be through the [CLS] token, a sentence transformer, averaging pooling, etc. This would get you the embeddings <emb1>, <emb2>, .... representing the embeddings of the sentence. Now you'd do something similar for the features to get <f1>, <f2>, ...

Let's say <feature1> is something simple like purchase. Just a single word to indicate you're testing whether or not that sentence indicates a purchase was made. You could compare how similar the embeddings <emb1>, <emb2>, ... are to that feature vector. If it's above some threshold, you can classify that sentence as positively having a purchase. If it's below some threshold, then now purchase was made.

One issue with this style of model is you may have a sentence that says "customer looked at items for purchase. no purchase made". If you're doing average pooling, it might see the multiple purchase features, and align too well to the purchase feature vector. So if you go this route you'd want to be careful with the type of model you're using for inference.

How difficult are LLMs in terms of grammar?

1 Answers1