2

I wanna understand why in this code, I get the following results:


# Import necessary libraries
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix
import lightgbm as lgb
import optuna
import mlflow
import mlflow.lightgbm

Load Titanic dataset

titanic_data = pd.read_csv('titanic.csv') # Assuming the dataset is stored in 'titanic.csv'

Select specific features

selected_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Survived'] titanic_data = titanic_data[selected_features]

Convert categorical features to numerical using one-hot encoding

titanic_data = pd.get_dummies(titanic_data, columns=['Sex', 'Embarked'], drop_first=True)

Extract features and target variable

X = titanic_data.drop('Survived', axis=1) y = titanic_data['Survived']

Split dataset

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

Train an initial LightGBM model

initial_model = lgb.train({}, lgb.Dataset(X_train, label=y_train), 100)

Make predictions and calculate F1 score

y_pred_initial = initial_model.predict(X_valid) f1_initial = f1_score(y_valid, (y_pred_initial > 0.5).astype(int))

Display F1 Score and Confusion Matrix

print(f"Initial F1 Score: {f1_initial}") print("Confusion Matrix:") print(confusion_matrix(y_valid, (y_pred_initial > 0.5).astype(int)))

When printing y_pref_initials I get an array of probabilities that has negative values and values greater than 1 : Example:


array([ 0.01546079,  0.17557856,  0.22971758,  1.23292351,  0.60531331,
        1.04524314,  0.7637124 ,  0.0458202 ,  0.63044718,  1.02387605,
        0.6441506 ,  0.15202829,  0.06836975,  0.12113314,  0.19732339,
        0.78233429,  0.37779053,  0.75745862,  0.29348834, -0.08458378,
       -0.07173513,  0.73006681,  0.38585976,  0.09324021, -0.02912595,
       -0.10779946,  0.22953974,  0.24480956])

==> I wanna know why the probability here is not between 0 and 1 please

Legna
  • 53
  • 5

1 Answers1

3

I would suggest you use the LGBMClassifier wrapper as a more user-friendly version lightgbm.train for classification problems which provides the predict_proba method of this class to avoid such issues:

# Initialize the LGBMClassifier
lgbm_classifier = LGBMClassifier()

Fit the model

lgbm_classifier.fit(X_train, y_train)

Make predictions

y_pred = lgbm_classifier.predict_proba(X_valid)

Calculate F1 score

f1 = f1_score(y_valid, y_pred[:,1] > 0.5) print(f"F1 Score: {f1}")

You can read more about this here.

mornington
  • 46
  • 2