0

I am participating in a Kaggle multiclass classification competition. The submissions will be scored based on the 'logloss' score. I am using Keras and Scikit libraries and a deep learning network model and have taken the below approach.

I have corrected class imbalance in the training data using oversampling the minority classes. I have split the training data into training (X_train, y_train) and validation datasets (X_test, y_test). I have scaled the features and I have done categorical encoding of labels.

When I run the model, I am getting very good Validation loss (1.708) and Validation accuracy (compared to Kaggle leaderboard scores; top logloss score is 1.744), but when I submit my predicted probabilities for different classes for the test_set, I am getting awfully high loss score (4+) (It is a different matter I got a different, decent score - 2.02, using a different model approach, which is reflected in the leaderboard).

Why is this? Any suggestions on what should be done or where I am going wrong?

total classes:

Class_3 51811 Class_7 51811 Class_2 51811 Class_5 51811 Class_1 51811 Class_9 51811 Class_6 51811 Class_8 51811 Class_4 51811 Name: target, dtype: int64 466299

X_train, X_test, y_train, y_test = tts(X, y,test_size =.3, stratify=y, random_state=9) print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape)

(326409, 75) (326409, 9) (139890, 75) (139890, 9)

display(X_train.head(3)) display(X_test.head(3)) display(y_train[:3]) display(y_test[:3])

feature_0   feature_1   feature_2   feature_3   feature_4   feature_5   feature_6   feature_7   feature_8   feature_9   ...     feature_65  feature_66  feature_67  feature_68  feature_69  feature_70  feature_71  feature_72  feature_73  feature_74

425643 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 3 0 1 0 0 0 303754 2 3 2 2 5 0 0 1 1 1 ... 1 0 0 0 0 0 0 4 6 0 80710 2 8 2 0 18 2 0 2 1 3 ... 0 0 4 1 0 3 0 0 1 0

3 rows × 75 columns feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_65 feature_66 feature_67 feature_68 feature_69 feature_70 feature_71 feature_72 feature_73 feature_74 300226 0 0 1 4 0 0 0 4 1 1 ... 1 0 1 0 0 1 0 0 2 2 124793 0 0 0 6 0 0 0 3 7 2 ... 0 0 0 0 0 0 0 0 0 0 439437 0 3 0 0 5 0 0 2 1 1 ... 2 0 0 0 3 0 4 0 0 0

3 rows × 75 columns

array([[0., 0., 0., 0., 0., 0., 0., 0., 1.], [0., 0., 0., 0., 0., 1., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

array([[0., 0., 0., 0., 0., 1., 0., 0., 0.], [0., 0., 1., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

print(X_train.index.isin(X_test.index).sum()) print(X_test.index.isin(X_train.index).sum()) 0 0

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.fit_transform(X_test) test_set = scaler.fit_transform(test_set)

from keras.optimizers import Adam from tensorflow.keras import layers

model = Sequential() model.add(Dense(1024, input_shape=(75,), activation='relu')) model.add(Dense(256, activation='relu')) model.add(Dense(64, activation='relu')) model.add(Dense(16, activation='relu')) model.add(Dense(9, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=.001), metrics=['accuracy'], )

from tensorflow.keras.callbacks import EarlyStopping monitor_val_acc = EarlyStopping(monitor='val_loss', patience=5) model.fit(X_train, y_train, epochs = 50, validation_split=.3, callbacks= [monitor_val_acc], batch_size=1024) accuracy = model.evaluate(X_test, y_test)[1] print('Accuracy:', accuracy)

............ Epoch 28/30 45/45 [==============================] - 5s 117ms/step - loss: 1.6676 - accuracy: 0.3626 - val_loss: 1.7675 - val_accuracy: 0.3333 Epoch 29/30 45/45 [==============================] - 5s 114ms/step - loss: 1.6140 - accuracy: 0.3809 - val_loss: 1.7815 - val_accuracy: 0.3357 Epoch 30/30 45/45 [==============================] - 5s 117ms/step - loss: 1.5942 - accuracy: 0.3869 - val_loss: 1.7126 - val_accuracy: 0.3563 4372/4372 [==============================] - 11s 2ms/step - loss: 1.7085 - accuracy: 0.3582 Accuracy: 0.3581957221031189

from sklearn.metrics import accuracy_score from sklearn.metrics import log_loss preds_val = model.predict(X_test)

preds_val[:3] array([[1.13723904e-01, 5.20741269e-02, 4.70720865e-02, 1.59640312e-02, 1.92086305e-02, 2.25828230e-01, 1.81854114e-01, 1.99746847e-01, 1.44528091e-01], [6.04994688e-03, 1.40825182e-01, 9.95656699e-02, 5.96038415e-04, 5.59030111e-09, 4.57442701e-02, 3.05081338e-01, 1.77178025e-01, 2.24959582e-01], [6.54266328e-02, 9.87399742e-02, 1.07230745e-01, 1.46904245e-01, 6.80148089e-03, 1.52257413e-01, 1.22348621e-01, 1.58026025e-01, 1.42264828e-01]], dtype=float32)

log_loss(y_test, preds_val) 1.708450169537806

Srinivas
  • 101

0 Answers0