I'm training a classifier on the DAIGT dataset. The objective is to differentiate human from AI text and so this is a binary classification problem. As a baseline before I move onto an LLM classifier, I am using a pipeline of a TF-IDF vectorizer and then a logistic regression classifier. However, when I try classifying the data this way I get extremely high scoring metrics. For example, the following code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(daigt_v2["text"], daigt_v2["label"]):
X_train, y_train = daigt_v2.iloc[train_idx]["text"], daigt_v2.iloc[train_idx]["label"]
X_test, y_test = daigt_v2.iloc[test_idx]["text"], daigt_v2.iloc[test_idx]["label"]
baseline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', LogisticRegression())
])
baseline.fit(X_train, y_train)
y_pred = baseline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["Human", "AI"]))
gives the following output:
precision recall f1-score support
Human 0.99 1.00 0.99 5475
AI 1.00 0.98 0.99 3499
accuracy 0.99 8974
macro avg 0.99 0.99 0.99 8974
weighted avg 0.99 0.99 0.99 8974
precision recall f1-score support
Human 0.99 1.00 0.99 5474
AI 0.99 0.98 0.99 3500
accuracy 0.99 8974
macro avg 0.99 0.99 0.99 8974
weighted avg 0.99 0.99 0.99 8974
precision recall f1-score support
Human 0.99 1.00 0.99 5474
AI 1.00 0.98 0.99 3500
accuracy 0.99 8974
macro avg 0.99 0.99 0.99 8974
weighted avg 0.99 0.99 0.99 8974
precision recall f1-score support
Human 0.99 1.00 0.99 5474
AI 0.99 0.98 0.99 3499
accuracy 0.99 8973
macro avg 0.99 0.99 0.99 8973
weighted avg 0.99 0.99 0.99 8973
precision recall f1-score support
Human 0.99 1.00 0.99 5474
AI 1.00 0.98 0.99 3499
accuracy 0.99 8973
macro avg 0.99 0.99 0.99 8973
weighted avg 0.99 0.99 0.99 8973
So we see that there is 0.99 f1 score and 0.99 classification accuracy. Which obviously seems way to high. However when I try using cross_validate like this:
baseline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', LogisticRegression())
])
scores = cross_validate(baseline,
daigt_v2["text"],
daigt_v2["label"],
cv=10,
scoring=["accuracy",
"f1",
"recall",
"precision",
"roc_auc",
"average_precision"])
summary = {key : float(np.mean(value)) for key, value in scores.items()}
summary returns as:
{'fit_time': 13.48662896156311,
'score_time': 5.418254947662353,
'test_accuracy': 0.8590308329341341,
'test_f1': 0.8367589483608666,
'test_recall': 0.9277524353897032,
'test_precision': 0.7674348038361346,
'test_roc_auc': 0.9595275583634191,
'test_average_precision': 0.9446004784576681}
Which are much more modest scores. Obviously I trust the second result better, but can anyone explain the discrepency here?