2

In standard machine learning settings with cross-sectional data, it's common to assume that data points are independently and identically distributed (i.i.d.) from some fixed data-generating process (DGP) $D$: $$ (x_i, y_i) \sim D, \quad \text{i.i.d. for } i = 1, \dots, N. $$
This assumption underpins many theoretical results, including convergence guarantees for empirical risk minimization and consistency of learned parameters.

Now consider a different structure:
Each training sample is itself a sequence: $$ x^{(i)} = (x^{(i)}_1, x^{(i)}_2, \dots, x^{(i)}_T), $$ and the sequences are sampled as: $$ \{x^{(i)}_t\}_{t=1}^T \sim D, \quad \text{for } i = 1, \dots, N, $$ where each entire sequence $x^{(i)}$ is drawn independently from the same data-generating process $D$. However, the elements within a given sequence may exhibit internal dependencies — for example, temporal correlation, autoregressive structure, or long-range dependencies.

Question:

In the context of modern sequence models — especially RNN-based architectures such as LSTMs and seq2seq models — can classical i.i.d. assumptions between each sample be meaningfully extended to this setting, where each training sample is a dependent sequence generated independently from a fixed DGP?

This structure is common in practice — for instance, in machine translation, time series forecasting, or sequential decision-making — where models are trained on a collection of sequences and are expected to capture complex internal dynamics within each.

A related discussion here mentions that models like LSTMs can learn long-term, nonlinear dependencies and may even adapt to non-stationarity. However, that discussion is more empirical, and I'm interested in understanding whether the classical theoretical frameworks (e.g., learning theory based on i.i.d. assumptions) apply or extend naturally in this setting.

Any theoretical insights, intuition, or references would be appreciated.

spie227
  • 101
  • 4

0 Answers0