Splitting training and test set on a time series problem

Question

I have an OHLCV dataset that starts on 01-01-2000 and ends on 31-12-2003 and I want to evaluate a model, say an SVM regressor.

What is the correct routine to evaluate the performance of the model from 01-01-2003?

These are the steps I performed:

For testday in [01-01-2003, 31-12-2003]:
1. X_train = data[01-01-2000, testday-1]
2. X_test = data[testday]
3. scaler_X = MinMaxScaler().fit(X_train)
4. X_scaled_train = scaler_X.transform(X_train)
5. X_scaled_test = scaler_X.transform(X_test)
6. model = svm.SVR()
7. model.fit(X_scaled_train, y_train)
8. res[testday]['pred'] = model.predict(X_scaled_test)[0]
9. res[testday]['real'] = y_test[0]

Finally, I get the accuracy with:

accuracy_score(res['real'], res['pred'])*100

Is this routine of training on a increasing number of days and testing on the next day correct?

score 0 · Answer 1 · answered Sep 18 '24 at 15:27

The type of testing you are doing is called out-of-period testing, in which you train on the same observations but at different points in time.

To have a stable model, There is an approach called multi-slicing here example on multi-slicing churn prediction. You need to train the model on different historical snippets of your time series. so you create a historical sub-time-series of the original total time series of the observations (client or model or whatever), then concatenate all time series and do your usual training (until your testday).

NOTE: Use Precision, Recall, F1 and roc_auc scores not accuracy link explains why here.

Splitting training and test set on a time series problem

1 Answers1