1

I am going through the "Text classification with TensorFlow Hub" tutorial. In this tutorial, a total of 50,000 IMDb reviews are split into 25,000 reviews for training and 25,000 reviews for testing.

I am surprised by this way of splitting the data, since I learned in Andrew Ng's course that for fairly small datasets (<10,000 examples) the "old-fashioned" rule of thumb was to consider 60% or 70% of the data as training examples and the remainder as dev/test examples.

Is there a reason behind this 50:50 split?

  • Is it common practice when working with text?
  • Does it have anything to do with using a "pre-trained" TensorFlow Hub layer?
Sheldon
  • 205
  • 2
  • 9

2 Answers2

2

Is it common practice when working with text?

No, you could split dataset as you wish, in general in real-world problem you should use cross validation.

Does it have anything to do with using a "pre-trained" TensorFlow Hub layer?

No, it doesn't.

fuwiak
  • 1,373
  • 8
  • 14
  • 26
2

A safer method is to use the integer part of the fraction (after truncating) $n_c \approx n^{3 \over 4}$ examples for training, and $n_v \equiv n - n_c$ for validation (a.k.a. testing). If you are doing cross-validation, you could perform that whole train-test split at least $n$ times (preferably $2n$ if you can afford it), recording average validation loss at the end of each cross-validation "fold" (replicate), which is what tensorflow records anyway; see this answer for how to capture it). When using Monte Carlo cross-validation (MCCV) then for each of the $n$ (or $2n$ if resource constraints permit) replicates, one could randomly select (without replacement to make things simpler) $n_c$ examples to use for training and use the remaining $n_v$ examples for validation, without even stratifying the subsets (based on class, for example, if you are doing classification).

This is based on a 1993 paper (look at my answer here for more information) by J. Shao in which he proves that $n_c \approx n^{3 \over 4}$ is optimal for linear model selection. At that time, non-linear models such as machine learning (see this answer for yet another discussion on that) were not as popular, but as far as I know (would love to be proven wrong) nobody has taken the time to prove anything similar for what is in popular use today, so this is the best answer I can give you right now.

UPDATE: Knowing that GPUs work most efficiently when they are fed a batch sized to be a power of two, I have calculated different ways to split data up into training and validation which would follow Jun Shao's strategy of making the training set size $n_c \approx n^{\frac{3}{4}}$ and where both $n_c$ and $n_v \equiv n - n_c$ are close to powers of two. An interesting note is that for $n = 640$, $n_c \approx 127$ and therefore $n_v \approx 513$; because $127 \approx 2^7$ and $513 \approx 2^9$ I plan to go ahead and use those as my training and validation test sizes whenever I am generating simulated data.

tdMJN6B2JtUe
  • 200
  • 8