11

Problem

I have tried using Naive bayes on a labeled data set of crime data but got really poor results (7% accuracy). Naive Bayes runs much faster than other alogorithms I've been using so I wanted to try finding out why the score was so low.

Research

After reading I found that Naive bayes should be used with balanced datasets because it has a bias for classes with higher frequency. Since my data is unbalanced I wanted to try using the Complementary Naive Bayes since it is specifically made for dealing with data skews. In the paper that describes the process, the application is for text classification but I don't see why the technique wouldn't work in other situations. You can find the paper I'm referring to here. In short the idea is to use weights based on the occurences where a class doesn't show up.

After doing some research I was able to find an implementation in Java but unfortunately I don't know any Java and I just don't understand the algorithm well enough to implement myself.

Question

where I can find an implementation in python? If that doesn't exist how should I go about implementing it myself?

grasshopper
  • 213
  • 1
  • 5

2 Answers2

5

Naive Bayes should be able to handle imbalanced datasets. Recall that the Bayes formula is

$$P(y \mid x) = \cfrac{P(x \mid y) \, P(y)}{P(x)} \propto P(x \mid y) \, P(y)$$

So $P(x \mid y) \, P(y)$ takes the prior $P(y)$ into account.

In your case maybe you overfit and need some smoothing? You can start with +1 smoothing and see if it gives any improvements. In python, when using numpy, I'd implement the smoothing this way:

table = # counts for each feature 
PT = (table + 1) / (table + 1).sum(axis=1, keepdims=1)

Note that this is gives you Multinomial Naive Bayes - which applies only to categorical data.

I can also suggest the following link: http://www.itshared.org/2015/03/naive-bayes-on-apache-flink.html. It's about implementing Naive Bayes on Apache Flink. While it's Java, maybe it'll give you some theory you need to understand the algorithm better.

Alexey Grigorev
  • 2,900
  • 1
  • 15
  • 19
1

My implementation of Complement Naive Bayes was merged into scikit-learn and can be found here.

airalcorn2
  • 11
  • 3