How do I determine the top "reason" for anomaly when using Isolation Forests

Question

I am using Isolation Forests for Anomaly Detection. Say, my set has 10 variables, var1, var2, ..., var10, and I found an anomaly. Can I rank the 10 variables var1, var2, ..., var10 in such a way I can say that I have an anomaly and the main reason is, say, var6.

For example, if I had var1, var2, var3 only, and my set were:

5   25   109
7   26   111
6   23   108
6   26   109
6   978  108
5   25   110
7   24   107

I would say that 6, 978, 108 is an anomaly and, in particular, the reason is var2.

Is there a way to determine the main reason why a particular entry is an anomaly?

Multivac · Answer 1 · 2021-08-26T15:55:24.850

A naive approach would be to use a supervised model to predict the target anomaly vs no anomaly that your IsolationForest model outputs, then if and only if this supervised binary classification model performs well(maybe you can use cv score), you can use your favorite feature importance tool to examine the impact/contribution of each feature

Mean decrease impurity if your model is a tree-based one (useful also to plot the/a tree of your model to understand the rules that make an observation an outlier.
Permutation importance for model agnostic (pre defined metric)
SHAP values for knowing more precisely the influence of each feature on your target (anomaly/no anomaly)

Edit:

I just make some research and I found that SHAP library has support for Isolation forest (check)

score 0 · Answer 2 · answered Sep 23 '21 at 09:09

0

Since a while back one can use SHAP to exlain scikit-learn Isolation Forest models. Example code and output in this answer.

answered Sep 23 '21 at 09:09

Jon Nordby

1,557
10
14

How do I determine the top "reason" for anomaly when using Isolation Forests

2 Answers2