My question is very similar to this SO post: How to embed Sequence of Sentences in RNN?
Using this code snippet as an example (for 2 sequences with each timestep observation containing 3 numerical features):
import numpy as np
np.random.seed(1)
num_seq = 2
num_features = 3
MAX_KNOWN_RESPONSE_VALUE = 120
lengths = np.random.randint(low = 30, high=30000, size = num_seq)
# lengths = array([29763, 265])
X_batch = list(map(lambda len: np.random.rand(len, num_features), lengths))
# X_batch[0].shape = (29763, 3)
# X_batch[1].shape = (265, 3)
y_batch = MAX_KNOWN_RESPONSE_VALUE * np.random.rand(2)
# y_batch = array([35.51784086, 96.78678551])
The only differences (compared to the linked StackOverflow post) are:
- The sequences are variable length ranging from 30 to 30000 timestep observations. In the above example containing 2 sequences, the first sequence has length 29763, and the second sequence has length 265.
- Each timestep observation per sequence contains a set of numerical features that are not sequentially defined. In the above example, each timestep observation has 3 numerical features without any associated ordering.
- The response value per sequence is a numerical response and not categorical.
MAX_KNOWN_RESPONSE_VALUEis the observed maximum response value in the training set, but during inference, the model should theoretically predict any nonnegative value (no negative response values are allowed, but predictions aboveMAX_KNOWN_RESPONSE_VALUEare permitted)
Some initial thoughts about tweaking the proposed StackOverflow solution to the previously asked question:
Regarding (1), in order for the RNN model to appropriately learn while minimizing the number of required zero-paddings, I am thinking that I would need to define BySequenceLengthSampler and pass that into the sampler parameter of my DataLoader object. I think it is also worthwhile for me to consider looking at packing my padded sequences, but I'm not entirely sure if this is specifically useful for my problem.
Regarding (2), I think there are multiple options here. My initial thoughts are inspired by this solution, but instead, I'd remove the embedding step and starting off with a Linear layer (instead of the LSTM layer identified in the solution) to translate from
(batch_size, max_seq_len, num_features)to(batch_size, max_seq_len, hidden_size)which then will go through an RNN cell, resulting in(batch_size, max_seq_len, hidden_size_2). I can also directly pass(batch_size, max_seq_len, num_features)through an RNN-cell to get(batch_size, max_seq_len, hidden_size).Regarding (3), instead of using the torch.sigmoid function which translates the output to a value between (0, 1) (clearly more suited for classification problems), I would either attempt to add a ReLU nonlinearity after the last Linear layer or add no nonlinearities and let the Linear layer make its prediction. I may need to transform my response value so that the weights don't explode for very large response values but I'm not sure if that's needed. However, I should definitely standardize my input values so that all features are mean 0 and standard deviation 1.
Am I approaching this problem correctly with the thoughts I had?
Could you provide me with a reproducible code solution to model this, alongside with some detailed explanation?