Incremental Learning in Scikit with PassiveAggressiveClassifier's partial_fit

Question

I'm trying to train a PassiveAggressiveClassifier using TfidVectorizer with partial_fit technique in the script below:

Code Updated:

a, ta = [], []
r, tr = [], []
g = []

vect = HashingVectorizer(ngram_range=(1,4))
model = PassiveAggressiveClassifier()
with open('files', 'rb') as f:
    for line in f:
        line = line.strip()
        with open('gau-' + line + '.csv', 'rb') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                res = row['gau']
                g.append(res)

        cls = np.unique(g)
        print(len(cls))

        with open('gau-' + line + '.csv', 'rb') as csvfile:
            reader = csv.DictReader(csvfile)
            i = 0
            j = True
            for row in reader:
                arr = row['text']
                res = row['gau']
                a.append(arr)
                if(len(res) > 0):
                    r.append(int(res))
                i = i + 1

                if i % 400 == 0:
                    training_set = vect.fit_transform(a)
                    print(training_set.shape)
                    training_result = np.array(r)
                    model = model.partial_fit(
                        training_set, training_result, classes=cls)
                    a, r, i = [], [], 0

        print(model)
        testing_set = vect.transform(ta)
        testing_result = np.array(tr)
        predicted = model.predict(testing_set)

        print "Result to be predicted: "+testing_result
        print "Prediction: "+predicted

There are multiple CSV files each containing 4k-5k records and I am trying to fit 400 records at a time using the partial_fit function. When I ran this code, I ran into the following error:

Result to be predicted: 1742
Prediction: 2617

How do I resolve this issue? The records in my CSV files are of variable length.

UPDATE:

Replacing TfidVectorizer with HashingVectorizer, I successfully created my model, but now while executing prediction on my test data the predictions generated were all incorrect. My training data contains millions of lines of csv files and each line contains at most 4k-5k words of text.

So Is there any problem with my approach i.e. can these algorithms can be used with my data?

Did your code fail on the line `model = model.partial_fit`? Or did it happen earlier? — TaoPR, Jun 11 '15 at 09:08
Its clearly on the second partial_fit as we have 2 shape output. the problem is that tdif output is not the same size from times to times. Does anyone know a way to have the same number of feature in everytransform of the vectorizer? — RPresle, Jun 11 '15 at 09:20

score 1 · Answer 1 · answered Jun 11 '15 at 19:12

I'm trying to train PassiveAggressiveClassifier using TfidVectorizer with partial_fit technique with below script:

You can't, because TfidfVectorizer does not work for online learning. You want HashingVectorizer for that.

As for what exactly is going on in your code, the problem is here:

training_set = vect.fit_transform(a)
print(training_set.shape)
training_result = np.array(r)
model = model.partial_fit(training_set, training_result, classes=cls)

You are refitting your TF-IDF object at each step. So there is nothing stopping you from having a dictionary size at one iteration and another at the next iteration, which is exactly the error you are getting.

You can try a few things if you insist on using TF-IDF:

Append zeroes / trim the vector returned by fit_transform to make the length of the first one: very unlikely to work well;
Call fit on the TF-IDF object with an initial data set (preferably a large one) and then call transform on the others. This might work better, but I still suggest the HashingVectorizer.

After using Hashing Vectorizer, my predicition results are not coming out correct. Actually my training data is in millions and I want to predict a single entity from that. Is there some over_fitting problem or what? Can you guide, please? — Jatin Bansal, Jun 12 '15 at 04:28
@IVlad why using tfidf with an online learning algorithm is wrong?.. I did not understood it. — tumbleweed, Mar 27 '17 at 19:06

score 1 · Answer 2 · answered Jan 10 '16 at 13:20

This is what i understand from your problem.

1) You have a requirement to apply the partial fit model to do the online training.

2) Your feature space is so huge.

If I got it right then I faced the same problem. And if you will use the HashingVectorizer, there are high chances of key collision.

HashingVectorizer doc

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary): there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model. there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems). no IDF weighting as this would render the transformer stageful.

If key will collide then there are chances of reduction in accuracy.

In my online training, firstly i trained the classifier with partial_fit like this.

classifier = MultinomialNB(alpha=alpha_optimized).partial_fit(X_train_tfidf,y_train,classes=np.array([0,1]))

On second day i load the pickled classifier, count_vect and tfidf of first day training set. Then I only applied the transform on count_vet and tfidf. And it worked

X_train_counts = count_vect.transform(x_train)
X_train_tfidf = tfidf.transform(X_train_counts)
pf_classifier.partial_fit(X_train_tfidf,y_train)

In case of any doubt please reply.

score 0 · Answer 3 · answered Jun 11 '15 at 09:28

As a solution I would say that using HashingVectorizer can fit your needs as you can set the number of features in the constructor.

You may prefer to use TfidfVectorizer and maybe it is more suitable for your situation. I let the answer until someone give something that use a more useful for you.

Hope there will be. Don't forget to accept the one you choose

score 0 · Answer 4 · answered Dec 04 '17 at 19:04

For those who HashingVectorizer doesn't meet their needs, see a possible alternative in my answer to this related question here. It's basically a custom implementation of partial_fit for TfidfVectorizer and CountVectorizer.

Two comments relating to the specific discussion here:

OP's issue requires that the dimension of the output vector be identical after every call of partial_fit. In general it is expected that every Scikit-Learn estimator that implements partial_fit be able to work within a pipeline after that partial_fit is called, so for vectorizers this means not changing the output dimension since other estimators in the pipeline may not necessarily be able to handle the change. I think this is why partial_fit has not yet been implemented in Scikit-Learn for these vectorizers (see discussion on an active PR), since partial_fit will presumably update the vocabulary which will definitely change the output dimension.
So the solution proposed by my answer (a partial_fit method for TfidfVectorizer) would only solve the first part of OP's needs which is incremental learning. To solve the second part it may be possible to pad the output sequence with zeros into a predetermined vector. It's not ideal, since it would fail when the vocabulary exceeds that limit.

Incremental Learning in Scikit with PassiveAggressiveClassifier's partial_fit

4 Answers4

Linked