18

I have a neural network to solve a time series forecasting problem. It is a sequence-to-sequence neural network and currently it is trained on samples each with ten features. The performance of the model is average and I would like to investigate whether adding or removing features will improve the performance. I have constructed the neural network using keras.

The features I have included are:

  1. The historical data
  2. quarterly lagged series of the historical data (4 series)
  3. A series of the change in value each week
  4. Four time invariant features tiled to extend the length of the series. (another 4 series)

I am aware I could run the model many times changing the combination of features included each time. However, along with tuning the hyperparameters (for it might be that 8 features works really well with one set of hyperparameters but not with another set) this is really a lot of possible combinations.

Is there any separate way that I can use to guage if a feature is likely to add value to the model or not?

I am particuarly concerned that I have four time-invariant features being fed into the model which is designed to work with time varying data and I would like a way to measure their impact and if they add anything or not?

Aesir
  • 458
  • 1
  • 6
  • 15

3 Answers3

15

Don't remove a feature to find out its importance, but instead randomize or shuffle it.

Run the training 10 times, randomize a different feature column each time and then compare the performance. There is no need to tune hyper-parameters when done this way.

Here's the theory behind my suggestion: feature importance

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
scholle
  • 174
  • 2
  • 3
6

Linking to the same paper as @scholle but explaining the process differently (book and paper).

  1. You do not need to train the model multiple times. The algorithm described in the links above require a trained model to begin with.
  2. Given a trained model, compute the metric of interest on some dataset (the book discusses pros/cons of using training set vs test set).
  3. For each feature in your same dataset, shuffle the values of the feature in question. All other features and labels should remain unchanged for each observation.
  4. Perform inference on the model with this shuffled dataset (one shuffled feature at a time), and compute the desired metric for each pass.
  5. Now compute the difference between the original metric (unchanged dataset) and the metric obtained for each feature pass (the book also mentions dividing the permuted score / original score).

Voila! The list of feature importance is the sorted output of step 5 (in descending order - higher value means the feature is more important to the model in question).

Edit - should I use training set or test/dev set to do permutation feature importance?

The book linked above addresses this question. A more concise answer can be found on SKLearn's docs:

Permutation importances can be computed either on the training set or on a held-out testing or validation set. Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. Features that are important on the training set but not on the held-out set might cause the model to overfit.

3

You can do this sort of thing using SHAP, it looks at permutation importance as well.