Why does the application of the logarithmic function improve the outcome of Random forests?

Question

I have a Random forest model that tries to predict what kind of a useful activity a machine is doing based on its power readings. There are 5 features in a single reading.

There are two main types of activities: Main (A set of useful activities. There are 6 such such activities.) and idle (the machine is not doing a useful activity).

Using an if-else method, I find whether a reading was generated by a main activity or an idle activity. Next, if the reading is from a main activity, I use the Random forest model, which has been trained with readings of main activities, to find the type of the main activity.

I trained the model with raw readings (Table 1 in figure) of main activities. But the model yielded poor results against the test data.

Then, I divided all the readings in the training dataset (which has only readings from main activities) by the mean of idle readings (let's call this idle-mean reading), and trained the Random forest with that (Dataset in Table 2). That also did not work well (I also divided the test data by the idle-mean reading).

Finally, I applied the logarithmic function to the training and test data. Here, each value in the dataset is replaced by the logA(B) where A (base) is the is the idle-mean reading and B (argument) is the reading (Table 3).

This yielded very good results. I tested it again and again with several test data sets.

My question: Can someone please explain why this works? My conceptual knowledge in ML is not great, so I'm having a hard time explaining this.

Nicolas Martin · Answer 1 · 2022-10-18T18:20:54.260

Is your data very sparse?

Usually, the log function is very useful to reduce data variability.

On the other hand, do you apply an inverted log to check if the final result is correct?

Very often, log results seem better because the data is easier to organize, but if we apply an inverted log to find the real value, the results might be wrong.

The wrong result on raw data could be due to a wrong train/test split, potentially without randomness.

I don't know what machines you are using, but generally speaking, machines depends on sensors that are not calibrated exactly the same and those calibration issues could explain some wrong results. Best practices suggest to learn on variations (i.e. +0.3 or -2.8) instead of raw values because you don't have some calibration biases. Have you tried this option?

Example:

Raw values:
32.5  35.3 32.2  25.6
Variations:

+3.2 -3.1 -6.6

Why does the application of the logarithmic function improve the outcome of Random forests?

1 Answers1