10

I tried to detect outliers in the energy gas consumption of some dutch buildings, building a neural network model. I have very bad results, but I can't find the reason.

I am not an expert so I would like to ask you what I can improve and what I'm doing wrong. This is the complete description: https://github.com/denadai2/Gas-consumption-outliers.

The neural network is a FeedFoward Network with Back Propagation. As described here I splitted the dataset in a "small" dataset of 41'000 rows, 9 features and I tried to add more features.

I trained the networks but the results have 14.14 RMSE, so it can't predict so well the gas consumptions, consecutively I can't run a good outlier detection mechanism. I see that in some papers that even if they predict daily or hourly consumption in the electric power, they have errors like MSE = 0.01.

What can I improve? What am I doing wrong? Can you have a look of my description?

VividD
  • 666
  • 7
  • 19
marcodena
  • 1,667
  • 4
  • 14
  • 17

4 Answers4

8

Just an idea - your data is highly seasonal: daily and weekly cycles are quite perceptible. So first of all, try to decompose your variables (gas and electricity consumption, temperature, and solar radiation). Here is a nice tutorial on time series decomposition for R.

After obtaining trend and seasonal components, the most interesting part begins. It's just an assumption, but I think, gas and electricity consumption variables would be quite predictable by means of time series analysis (e.g., ARIMA model). From my point of view, the most exiting part here is to try to predict residuals after decomposition, using available data (temperature anomalies, solar radiation, wind speed). I suppose, these residuals would be outliers, you are looking for. Hope, you will find this useful.

sobach
  • 1,139
  • 5
  • 20
3

In your training notebook you present results for training with 20 epochs. Have you tried varying that parameter, to see if it affects your performance? This is an important parameter for back-propagation.

For estimating your model parameters, as user tomaskazemekas pointed out, plotting Learning Curves is a very good approach. In addition to that, you could also create a plot using a model parameter (e.g. training epochs or hidden layer size) vs. Training and Validation error. This will allow you to understand the bias/variance tradeoff, and help you pick a good value for your parameters. Some info can be found here. Naturally, it is a good idea to keep a small percentage of your data for a (third) Test set.

As a side note, it seems that increasing the number of neurons in your model show no significant improvement for your RMSE. This suggests that you could also try with a simpler model, i.e. with less neurons and see how your model behaves.

In fact, I would suggest (if you haven't done so already) trying a simple model with few or no parameters first e.g. Linear Regression, and compare your results with the literature, just as a sanity check.

insys
  • 459
  • 4
  • 9
2

The main problem here is that even before attempting to apply anomaly detection algorithms, you are not getting good enough predictions of gas consumption using neural networks.

If the main goal here is to reach the stage when anomaly detection algorithms could be used and you state that you have access to examples of successful application of linear regression for this problem, this approach could be more productive. One of the principles of successful machine learning application is that several different algorithms can be tried out before final selection based on results.

It you choose to tune your neural network performance, learning curve plotting the effect of change in different hyperparameters on the error rate can be used. Hyperparameters that can be modified are:

  • number of features
  • order of the polynomial
  • regularization parameter
  • number of layers in the network

Best settings can be selected by the performance on cross validation set.

tomaskazemekas
  • 313
  • 2
  • 13
2

In your notebooks, I did not see your neural network model, can you point which library is using, how many layers you have and what type of neural network are you using?

In your notebooks, it seems you are using the noisy and outlier dataset to train the neural network, I think you should train the neural network on the dataset that you do not have any outliers so that you could see the observation distance from the prediction of the neural network to label the observation either outlier or not.

I wrote couple of things on outlier detection in time-series signals, your data is highly seasonal as sobach mentioned and you could use FFT(first link above) to get the overall trend in the signal. After you get the frequency component in the gas consumption, you could look at the high frequency components to get the outliers.

Also if you want to insist on using neural network for seasonal data, you may want to check recurrent neural networks out as they could incorporate the past observations better than a vanilla neural network, and supposedly may provide a better result for the data that you have.

Bugra
  • 21
  • 1