8

Suppose I have a data set : Amount of money (100, 50, 150, 200, 35, 60 ,50, 20, 500). I have Googled the web looking for techniques that can be used to find a possible outlier in this data set but I ended up confused.

My question is: Which algorithms, techniques or methods can be used to detect possible outlier in this data set?

PS:Consider that the data does not follow a normal distribution. Thanks.

CN1002
  • 243
  • 2
  • 7

3 Answers3

5

You can use BoxPlot for outlier analysis. I would show you how to do that in Python:

Consider your data as an array:

a = [100, 50, 150, 200, 35, 60 ,50, 20, 500]

Now, use seaborn to plot the boxplot:

import seaborn as sn
sn.boxplot(a)

So, you would get a plot which looks somewhat like this:

enter image description here

Seems like 500 is the only outlier to me. But, it all depends on the analysis and the tolerance level of the analyst or the statistician and also the problem statement.

You can have a look at one of my answers on the CrossValidated SE for more tests.

And there are several nice questions on outliers and the algorithms and techniques for detecting them.

My personal favourite is the Mahalanobis distance technique.

Dawny33
  • 8,476
  • 12
  • 49
  • 106
5

One way of thinking of outlier detection is that you're creating a predictive model, then you're checking to see if a point falls within the range of predictions. From an information-theoretic point of view, you can see how much each observation increases the entropy of your model.

If you are treating this data as just a collection of numbers, and you don't have some proposed model for how they're generated, you might as well just look at the average. If you're certain the numbers aren't normally distributed, you can't make statements as to how far 'off' a given number is from the average, but you can just look at it in absolute terms.

Applying this, you can take the average of all the numbers, then exclude each number and take the average of the others. Whichever average is most different from the global average is the biggest outlier. Here's some python:

def avg(a):
    return sum(a)/len(a)

l = [100, 50, 150, 200, 35, 60 ,50, 20, 500]
m = avg(l)
for idx in range(len(l)):
    print("outlier score of {0}: {1}".format(l[idx], abs(m - avg([elem for i, elem in enumerate(l) if i!=idx]))))
>>
outlier score of 100: 4
outlier score of 50: 10
outlier score of 150: 3
outlier score of 200: 9
outlier score of 35: 12
outlier score of 60: 9
outlier score of 50: 10
outlier score of 20: 14
outlier score of 500: 46 
Tristan Reid
  • 151
  • 2
3

A simple approach would be using the same thing as box plots does: away than 1.5 (median-q1) or 1.5 (q3-median) = outlier.

I find it useful in lots of cases even it not perfect and maybe too simple.

It has the advantage to not suppose normality.

Michael Hooreman
  • 813
  • 2
  • 10
  • 21