1

The task I'm used to do is the following. A client comes to see me with a set of clients (called positive companies) and he wants me to find other similar prospects. Usually, he also gives me a set of negatives companies and I have a big set of potential companies (that I call the basket).

I perform this task by doing a Adaboost classifier that I train with the positives and negatives. I then run this classifier on the basket. Each company in the basket receives a score and the highest score shows the most promising prospects for the client.

Now, a new client doesn't have any set of negatives to give and I'm a little bit lost. I can not do a supervised learning anymore, obviously. I first thought of performing a k-nearest neighbours on each positive and I would receive a list of "close" prospects. The problem with that is that I don't have a score anymore. Furthermore, with the k-nearest method, I should define a distance which I don't like because I don't want to give subjective weights to features. Indeed, the Adaboost classifier would learn some weights and would itself predict which features are important.

Could someone indicate me how I could tackle this problem?

Dust009
  • 113
  • 2

1 Answers1

2

To summarize, you have labeled data in one class (positive) and unlabeled data. You want to find the positive examples in the unlabeled data. The general name for this setting in machine learing is one-class classification, which is a fairly broad field.

A sub-area that is particularly relevant is positive-unlabeled learning, which is the problem of training a classifier when one has just positive and unlabeled data.

Also note that you have all the examples that need to be predicted at training time. Therefore you can use a transductive learning algorithm. Particularly, if you have a notion of which companies are similar, you could construct a graph by connecting similar companies by edges. You could then run a graph propagation algorithm that would assign scores to the unlabeled items.

Finally, here is a similar question where the answer suggests a method of positive-unlabeled learning.