Questions tagged [outlier]

For questions regarding outliers or unusual points in the data.

An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset. A discomfiting possibility is that these data come from a different population than the one intended to be studied.

However, outliers are not necessarily bad or wrong, nor do they need to be removed from data for further analysis. However, outliers (of which there can be more than one in any set of data) indicate that some data at least appear to differ from the bulk of the dataset, suggesting they should be individually examined and understood. Also, some statistical procedures are sensitive to outliers: this means that removal of one or more outliers could substantially change the conclusions of those procedures.

228 questions
16
votes
3 answers

How to remove outliers using box-plot?

I have data of a metric grouped date wise. I have plotted the data, now, how do I remove the values outside the range of the boxplot (outliers)? All the ['AVG'] data is in a single column, I need it for time series modelling.
Uday T
  • 362
  • 1
  • 5
  • 11
13
votes
4 answers

What is the difference between outlier detection and anomaly detection?

I would like to know the difference in terms of applications (e.g. which one is credit card fraud detection?) and in terms of used techniques. Example papers which define the task would be welcome.
Martin Thoma
  • 19,540
  • 36
  • 98
  • 170
11
votes
2 answers

Tools for automatic anomaly detection on a SQL table?

I have a large SQL table that is essentially a log. The data is pretty complex and I'm trying to find some way to identify anomalies without me understanding all the data. I've found lots of tools for Anomaly Detection but most of them require a…
THE JOATMON
  • 211
  • 2
  • 4
10
votes
3 answers

Isolation forest sklearn contamination param

I am working on an unsupervised anomaly detection task on time series data using an isolation forest algorithm. I am developing it in Python, more in detail using scikit-learn. I found a lot of examples on this, but what is not very clear, is how to…
10
votes
4 answers

Gas consumption outliers detection - Neural network project. Bad results

I tried to detect outliers in the energy gas consumption of some dutch buildings, building a neural network model. I have very bad results, but I can't find the reason. I am not an expert so I would like to ask you what I can improve and what I'm…
marcodena
  • 1,667
  • 4
  • 14
  • 17
10
votes
2 answers

Scalable Outlier/Anomaly Detection

I am trying to setup a big data infrastructure using Hadoop, Hive, Elastic Search (amongst others), and I would like to run some algorithms over certain datasets. I would like the algorithms themselves to be scalable, so this excludes using tools…
doublebyte
  • 430
  • 3
  • 9
10
votes
1 answer

Difference: Replicator Neural Network vs. Autoencoder

I'm currently studying papers about outlier detection using RNN's (Replicator Neural Networks) and wonder what is the particular difference to Autoencoders? RNN's seem to be treaded for many as the holy grail of outlier/anomaly detection, however…
Nex
  • 285
  • 2
  • 6
9
votes
3 answers

In elbow curve how to find the point from where the curve starts to rise?

I am computing a distance metric on my data. The result is then being sorted in ascending order. The samples having distance more than a specific threshold are to be marked as outliers and will be discarded. Below is a plot of all distance…
Faiz Kidwai
  • 235
  • 1
  • 2
  • 12
9
votes
3 answers

Regression model R2 drops when I remove outliers: is that even possible?

I'm analyzing how outliers in my dataset of size 8x8000 affect regression models. I have three scenarios: raw dataset (with outliers), Winsorized dataset (2% of the extreme outliers adjusted), and dataset without outliers (rows with outliers…
ml.freak
  • 103
  • 4
8
votes
3 answers

Which algorithms or methods can be used to detect an outlier from this data set?

Suppose I have a data set : Amount of money (100, 50, 150, 200, 35, 60 ,50, 20, 500). I have Googled the web looking for techniques that can be used to find a possible outlier in this data set but I ended up confused. My question is: Which…
CN1002
  • 243
  • 2
  • 7
7
votes
1 answer

How to decide how many n_neighbors to consider while implementing LocalOutlierFactor?

I have a data set with rows: 134000 and columns: 200. I am trying to identify the outliers in data set using LocalOutlierFactor from scikit-learn. Although I understand how the algorithm works, I am unable to decide n_neighbors for my data…
7
votes
3 answers

Which outlier detection can detect these outliers?

I have a vector and want to detect outliers in it. The following figure shows the distribution of the vector. Red points are outliers. Blue points are normal points. Yellow points are also normal. I need an outlier detection method (a…
6
votes
2 answers

How to scale outputs from AutoEncoder from multiple models?

I have a problem for which I have not been able to find any answers in my search so far. BACKGROUND I am working on an anomaly detection problem on machines utilising an auto-encoder. I am building a model file per machine because the machines'…
6
votes
2 answers

Effect of outliers on Naive Bayes

Are Naive Bayes algorithms affected by outliers in the data? Suppose there is a data set, does one need to remove outliers before applying Naive Bayes?
James Smith
  • 61
  • 1
  • 2
6
votes
4 answers

Handling outliers and Null values in Decision tree

Outliers : As I understand, decision trees are robust to outliers. Can anybody please confirm if my hypothesis is right with an example? (What if I have a features ranging from 0 to 9 but there is an outlier of which value is 10000?) Whether it…
deepguy
  • 1,471
  • 8
  • 21
  • 39
1
2 3
15 16