2

I have a (training) dataset about what TV spectators are watching and for how long. The goal (at new set - the test set) is to predict for how long the TV spectators will watch a specific channel and show.

Specifically, I have the following predictors:

  • TV Channel (e.g. BBC, CNN etc)
  • Content (e.g. news, entertainment, business etc)
  • Starting time (e.g. 11:00, 14:00 etc)

and the following target:

  • End time (e.g. 12:00, 15:00 etc)

Obviously I am going to apply One Hot Encoding with the TV Channel and Content predictors and handle Starting time in a way (see more here: Encoding features like month and hour as categorial or numeric?).

However, in my training set I may have multiple observations with the same predictors' values (e.g. 'BBC', 'news', '20:00') but with different output. This is obviouly done because different users are starting to watch the same thing at the same time but they stop at different times.

Is this going to be a problem since also my test set includes observations like these?

Specifically, I do not want to receive the same output (end-time) for these observations but I want to receive different outputs which (ideally) follow the distribution of the respective observations in the training set. How can I achieve this?

Shall I simply add a new categorical variable for each user?

Outcast
  • 1,117
  • 3
  • 14
  • 29

1 Answers1

1

I hope I did not misunderstand your question, as otherwise, this may sound too trivial to you.

I don't think its per se a problem that the same X (predictors) values lead to different Y (end time). After all, if every combination of X would lead to a unique Y, what is the point of estimating a model? Think, of a simple algorithm may predict the majority, e.g. for the same 10 X, you may get

  • 5 * Y1
  • 3 * Y2
  • 2 * Y3

so that your prediction is Y1 for this X.

That being said, I don't think your target variable needs to be categorical, I'd rather configure it continuous, the same with start time. And then, in the simple case, run a regression.

Further, I would not recommend grouping by the same X and add a categorical user type. This leads to a deterministic relationship (mapping), i.e.

  • User type 1 --> Y1
  • User type 2 --> Y2

because the way you derive user type is by finding the combination of X that leads to an unique Y.

I'd rather put this as a separate analysis, i.e. inferring user types from TV watching, where you basically cluster your data and say each cluster is a user type.

Edit

Based on your edited problem statement, recall that $Y$ is modelled conditionally on $X$, i.e.

$Y_i = E[Y_i|X_i] + \epsilon_i$.

So your estimated end time $\hat{Y}$ for a certain combination of $X$, e.g. $X_i = \{BBC, news, 20:00\}$, is the conditional expected value, i.e. mean estimate, but your predicted $\hat{Y_i}$ has a distribution (and if you would be more conservative, you could give a confidence interval for your prediction, which graphically speaking is a snippet from that distribution). This distribution is different for every $X_i$, see e.g. the figure in this doc

Now, to your request: what you could do is to sample from the distribution of $Y_i$ (the respective end time). Say you got 10 times the same $X_i = \{BBC, news, 20:00\}$, then sample 10 times from $Y_i$ and output that.

Yet, I don't think this to be the best solution. Graphically speaking again, I would provide the error bounds, i.e. as in the last graph here. In words, don't output 10 different predictions for the same input, but rather output that you predict the end time $Y_i$ for $X_i = \{BBC, news, 20:00\}$ is with 90% confidence between 20:55 and 21:10.

Nic
  • 161
  • 2