Questions tagged [lstm]

LSTM stands for Long Short-Term Memory. When we use this term most of the time we refer to a recurrent neural network or a block (part) of a bigger network.

LSTM (Long Short-Term Memory)

LSTM is a specialized type of Recurrent Neural Network (RNN) architecture designed to address the vanishing gradient problem that affects standard RNNs. Introduced by Hochreiter and Schmidhuber in 1997, LSTMs can learn long-term dependencies in sequential data.

Key Characteristics

  • Memory Cell: Contains a cell state that acts as a conveyor belt of information flowing through the network
  • Gating Mechanism: Uses three gates (input, forget, and output) to regulate information flow
  • Long-term Dependencies: Effectively captures relationships between elements separated by many time steps
  • Gradient Control: Special architecture prevents vanishing/exploding gradients common in vanilla RNNs

Applications

  • Natural Language Processing (text generation, machine translation)
  • Time Series Analysis and Prediction
  • Speech Recognition
  • Music Generation
  • Video Analysis
  • Anomaly Detection in sequential data

Technical Details

LSTMs process sequences through a chain of repeating modules. Each module contains:

  • Forget Gate: Decides what information to discard from cell state
  • Input Gate: Controls what new information enters the cell state
  • Output Gate: Determines what parts of the cell state are output

Their ability to selectively remember or forget information makes LSTMs particularly effective for sequential data with long-range dependencies.

Learning Resources

Foundational Papers

Online Tutorials

Courses

Books

  • "Deep Learning" by Goodfellow, Bengio, and Courville (Chapter on Sequence Modeling)
  • "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron

Implementation Resources

1136 questions
184
votes
6 answers

When to use GRU over LSTM?

The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates). Why do we make use of GRU when we clearly have more control on the network…
Sayali Sonawane
  • 2,101
  • 3
  • 13
  • 13
38
votes
1 answer

Time Series prediction using LSTMs: Importance of making time series stationary

In this link on Stationarity and differencing, it has been mentioned that models like ARIMA require a stationarized time series for forecasting as it's statistical properties like mean, variance, autocorrelation etc are constant over time. Since…
35
votes
6 answers

Validation loss is not decreasing

I am trying to train a LSTM model. Is this model suffering from overfitting? Here is train and validation loss graph:
DukeLover
  • 601
  • 1
  • 7
  • 15
31
votes
2 answers

How to feed LSTM with different input array sizes?

If I like to write a LSTM network and feed it by different input array sizes, how is it possible? For example I want to get voice messages or text messages in a different language and translate them. So the first input maybe is "hello" but the…
user3486308
  • 1,310
  • 5
  • 19
  • 29
26
votes
2 answers

What's the difference between the cell and hidden state in LSTM?

LSTM cells consist of two types of states, the cell state and hidden state. How do cell and hidden states differ, in terms of their functionality? What information do they carry?
user105907
25
votes
4 answers

What does the output of model.predict function from Keras mean?

I have built a LSTM model to predict duplicate questions on the Quora official dataset. The test labels are 0 or 1. 1 indicates the question pair is duplicate. After building the model using model.fit, I test the model using model.predict on the…
Dookoto_Sea
  • 361
  • 1
  • 3
  • 3
24
votes
2 answers

Sliding window leads to overfitting in LSTM?

Will I overfit my LSTM if I train it via the sliding-window approach? Why do people not seem to use it for LSTMs? For a simplified example, assume that we have to predict the sequence of characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y…
Kari
  • 2,756
  • 2
  • 21
  • 51
22
votes
2 answers

What is the job of "RepeatVector" and "TimeDistributed"?

I read about them in Keras documentation and other websites, but I couldn't exactly understand what exactly they do and how should we use them in designing many-to-many or encoder-decoder LSTM networks? I saw them used in the solution of this…
user3486308
  • 1,310
  • 5
  • 19
  • 29
22
votes
1 answer

Understanding Timestamps and Batchsize of Keras LSTM considering Hiddenstates and TBPTT

What I'm trying to do What I am trying to do is predicting the next data-point $x_t$ for each point in the timeseries $[x_0, x_1, x_2,...,x_T]$ in the context of a date-stream in real-time, in theory the series is infinity. If a new value $x$ is…
KenMarsu
21
votes
3 answers

What is LSTM, BiLSTM and when to use them?

I am very new to Deep learning and I am particularly interested in knowing what are LSTM and BiLSTM and when to use them (major application areas). Why are LSTM and BILSTM more popular than RNN? Can we use these deep learning architectures in…
Volka
  • 731
  • 3
  • 6
  • 21
19
votes
3 answers

Advantages of stacking LSTMs?

I'm wondering in what situations it is advantageous to stack LSTMs?
Vadim Smolyakov
  • 656
  • 1
  • 5
  • 14
17
votes
1 answer

Multi-dimentional and multivariate Time-Series forecast (RNN/LSTM) Keras

I have been trying to understand how to represent and shape data to make a multidimentional and multivariate time series forecast using Keras (or TensorFlow) but I am still very unclear after reading many blog posts/tutorials/documentation about how…
Bastien
  • 273
  • 1
  • 2
  • 6
17
votes
5 answers

Prediction interval around LSTM time series forecast

Is there a method to calculate the prediction interval (probability distribution) around a time series forecast from an LSTM (or other recurrent) neural network? Say, for example, I am predicting 10 samples into the future (t+1 to t+10), based on…
4Oh4
  • 308
  • 1
  • 2
  • 7
16
votes
2 answers

Dropout on which layers of LSTM?

Using a multi-layer LSTM with dropout, is it advisable to put dropout on all hidden layers as well as the output Dense layers? In Hinton's paper (which proposed Dropout) he only put Dropout on the Dense layers, but that was because the hidden inner…
BigBadMe
  • 760
  • 1
  • 7
  • 19
14
votes
2 answers

How to implement "one-to-many" and "many-to-many" sequence prediction in Keras?

I struggle to interpret the Keras coding difference for one-to-many (e. g. classification of single images) and many-to-many (e. g. classification of image sequences) sequence labeling. I frequently see two different kind of codes: Type 1 is where…
Hendrik
  • 8,767
  • 17
  • 43
  • 55
1
2 3
75 76