I'm trying to train a PassiveAggressiveClassifier using TfidVectorizer with partial_fit technique in the script below:
Code Updated:
a, ta = [], []
r, tr = [], []
g = []
vect = HashingVectorizer(ngram_range=(1,4))
model = PassiveAggressiveClassifier()
with open('files', 'rb') as f:
for line in f:
line = line.strip()
with open('gau-' + line + '.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
res = row['gau']
g.append(res)
cls = np.unique(g)
print(len(cls))
with open('gau-' + line + '.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
i = 0
j = True
for row in reader:
arr = row['text']
res = row['gau']
a.append(arr)
if(len(res) > 0):
r.append(int(res))
i = i + 1
if i % 400 == 0:
training_set = vect.fit_transform(a)
print(training_set.shape)
training_result = np.array(r)
model = model.partial_fit(
training_set, training_result, classes=cls)
a, r, i = [], [], 0
print(model)
testing_set = vect.transform(ta)
testing_result = np.array(tr)
predicted = model.predict(testing_set)
print "Result to be predicted: "+testing_result
print "Prediction: "+predicted
There are multiple CSV files each containing 4k-5k records and I am trying to fit 400 records at a time using the partial_fit function. When I ran this code, I ran into the following error:
Result to be predicted: 1742
Prediction: 2617
How do I resolve this issue? The records in my CSV files are of variable length.
UPDATE:
Replacing TfidVectorizer with HashingVectorizer, I successfully created my model, but now while executing prediction on my test data the predictions generated were all incorrect.
My training data contains millions of lines of csv files and each line contains at most 4k-5k words of text.
So Is there any problem with my approach i.e. can these algorithms can be used with my data?