I've been working on a phishing detection project as a training exercise. After cleaning the data, creating new features, scaling non-binary ones, and training a random forest model, I achieved an F1 score of 0.999 on the test set. Given the high score, I'm concerned that my model may be overfitting, although I took steps to reduce overfitting, such as removing highly correlated features (e.g., I kept only one of URLLength and NoOfLettersInURL due to a Pearson correlation > 0.9) and balancing the dataset (57% labeled as 0, 43% labeled as 1).
As per the exercise requirements, I dropped specific features (URLSimilarityIndex, CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb) and used a StandardScaler for non-binary columns. I’ve also performed additional feature engineering, including calculating ratios like self_reference_ratio and entropy for the URL.
Here’s my primary question:
Is there a way to confirm whether the high F1 score reflects actual performance or potential overfitting, what solutions do you suggest (Any modifications on the algorithm? the dataset ?)?
Thank you!
Here is the link to the dataset : https://archive.ics.uci.edu/dataset/967/phiusiil+phishing+url+dataset
Here is the code:
import pandas as pd
import math
from collections import Counter
import numpy as np
import re
from scipy.stats import pointbiserialr
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.utils import shuffle
from xgboost import XGBClassifier
data = pd.read_csv('PhiUSIIL_Phishing_URL_Dataset.csv')
data.drop(columns=['URLSimilarityIndex', 'CharContinuationRate', 'URLTitleMatchScore', 'URLCharProb', 'TLDLegitimateProb'], inplace = True)
data1 = data.copy()
def self_reference_ratio(row):
total_refs = row['NoOfSelfRef'] + row['NoOfExternalRef']
return row['NoOfSelfRef'] / total_refs if total_refs != 0 else 0
def external_reference_ratio(row):
total_refs = row['NoOfSelfRef'] + row['NoOfExternalRef']
return row['NoOfExternalRef'] / total_refs if total_refs != 0 else 0
def image_to_resource_ratio(row):
total_resources = row['NoOfImage'] + row['NoOfCSS'] + row['NoOfJS']
return row['NoOfImage'] / total_resources if total_resources != 0 else 0
def calculate_entropy(url):
url_part = url.split('://')[-1]
char_count = Counter(url_part)
length = len(url_part)
entropy = -sum((freq / length) * math.log2(freq / length) for freq in char_count.values())
return entropy
def calculate_js_usage_ratio(row):
return row['NoOfJS'] / row['LineOfCode'] if row['LineOfCode'] != 0 else 0
def detect_non_standard_port(url):
match = re.search(r":(\d+)", url)
if match:
port = int(match.group(1))
return 1 if port not in [80, 443] else 0
return 0
def detect_prefix_suffix(domain):
return 1 if '-' in domain else 0
data1['non_std_port'] = data1['URL'].apply(detect_non_standard_port)
data1['prefix_suffix'] = data1['Domain'].apply(detect_prefix_suffix)
data1[['non_std_port', 'prefix_suffix']].head()
data1['self_reference_ratio'] = data1.apply(self_reference_ratio, axis=1)
data1['external_reference_ratio'] = data1.apply(external_reference_ratio, axis=1)
data1['image_to_resource_ratio'] = data1.apply(image_to_resource_ratio, axis=1)
data1['entropy'] = data1['URL'].apply(calculate_entropy)
data1['js_usage_ratio'] = data1.apply(calculate_js_usage_ratio, axis=1)
data1.drop(columns = ['FILENAME','URL','Domain','TLD','Title'], inplace = True)
binary_columns = [col for col in data1.columns if data1[col].nunique() == 2]
non_binary_columns = [col for col in data1.columns if col not in binary_columns + ['label']]
scaler = StandardScaler()
data1[non_binary_columns] = scaler.fit_transform(data1[non_binary_columns])
data1 = shuffle(data1, random_state=42)
pd.set_option('display.max_columns', None)
X = data1.drop(columns=['label'])
y = data1['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)