0

I am leveraging an isolation forest model from the scikit-learn library for anomaly detection in a time series dataset where each point in the dataset is a data frame. However, I possess additional knowledge about specific criteria that can identify non-anomalous data points. How can I integrate this knowledge with the existing model to enhance its accuracy?

Before describing please consider that I am newbie to data science so pardon the flaws. I know I can use supervised methods or hyperparameter on top of anomaly detection methods to incorporate what I asked. But I feel it will destroy the advantage of using anomaly detection over supervised learning i.e. considering each batch independent of others and taking decision based on that batch's information itself. I used anomaly detection instead of any other supervised learning method because I feel the data may evolve and the rate of false/anolamaly occurence is nearly constant over different batch; and anomaly detection being dynamic can handle such situations. If I used supervised learning method on top of anomaly detection this novelty will be lost as dense layer will consider past data and may lead to positive feedback of errors.

I am trying to find method to improve the anomaly results such that this method can guide the original algorithm based on criterias that the point which are closer to 'good rows' are not to be considered anomaly, but originality i.e. anomaly detection without this consideration also retains some weightage. This method should solely rely on current batch's knowledge.

I understand that my approach may be flawed, and I welcome any corrections and suggestions.

Edit: Based on comments I'll try to briefly explain implementation and usage. I have a process which keeps on generating new data the rate is not constant (one of the reason for opting for anomaly detection); as I am accumulating results of anomalies I donot want to find all anomalies at once but I want that if it detect the anomaly it should be an anomaly (i.e. high precision); rates are approximately two anomalies out of 900 points. I poll and wait till the process has generated atleast 900 data samples and then feed them to the isolation forest. This gives distance based score and I currently am using static threshold to point out anomaly points. Now among those samples there are some samples which I am sure are not anomaly. There are also some other samples which I am sure are not anomaly but I think if anomaly detection learns from them it might miss some actual anomalies so I want that they only subtly impact the original anomaly algo. Also I donot want the method to be static in the sense that while some data samples may be good right now but similar samples can be rightly classified to anomaly based on that time slot's other data samples; thus I am avoiding hyper parameter optimization or dnn on top of anomaly detection from supervised data.

Another question I have is will it be feasible if I put the 'good point' criteria before fitting the data frame? I think in this case the information will be lost so I haven't done it yet.

SUNITA GUPTA
  • 111
  • 2

0 Answers0