What should the target variable (y) look like here?

Question

I am doing some data science problems for practice, and this is the question I'm currently tackling:

Given a list of L values generated independently by some unknown process, we will use the mean of L to predict unseen values generated by the same process. Use leave-one-out cross-validation to estimate the mean absolute error (MAE) of this process.

Input: An array of floats arr
Output: A float score

Example:

arr = [1,2,3],
score = 1.0

Now, usually, the input variables (X) and target variable (y) have the same number of rows. But in this case, since it says "we will use the mean of L to predict unseen values", what does y look like? Because in the given example, X has just one column, so if we take the mean of X, we will get a scalar value, which gives error when trying to do cross-validation:

from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
input list of values
x = [[2, 5, 4, 3, 4, 6, 7, 5, 8, 9]]
define the output as the mean of the inputs, as specified in the question
y = [np.mean(x)]
build multiple linear regression model
model = LinearRegression()
define cross-validation method to use
cv = LeaveOneOut()
use LOOCV to evaluate model
scores = cross_val_score(model, x, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
view mean absolute error
np.mean(np.absolute(scores))
>>>
Empty                                     Traceback (most recent call last)
File ~/miniforge3/lib/python3.10/site-packages/joblib/parallel.py:862, in Parallel.dispatch_one_batch(self, iterator)
    861 try:
--> 862     tasks = self._ready_batches.get(block=False)
    863 except queue.Empty:
    864     # slice the iterator n_jobs * batchsize items at a time. If the
    865     # slice returns less than that, then the current batchsize puts
   (...)
    868     # accordingly to distribute evenly the last items between all
    869     # workers.
File ~/miniforge3/lib/python3.10/queue.py:168, in Queue.get(self, block, timeout)
    167     if not self._qsize():
--> 168         raise Empty
    169 elif timeout is None:
Empty:
During handling of the above exception, another exception occurred:
ValueError                                Traceback (most recent call last)
Input In [70], in <cell line: 18>()
     15 cv = LeaveOneOut()
     17 # use LOOCV to evaluate model
---> 18 scores = cross_val_score(model, x, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
     20 # view mean absolute error
     21 np.mean(np.absolute(scores))
File ~/miniforge3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:515, in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    512 # To ensure multimetric format is not supported
    513 scorer = check_scoring(estimator, scoring=scoring)
--> 515 cv_results = cross_validate(
    516     estimator=estimator,
    517     X=X,
    518     y=y,
    519     groups=groups,
    520     scoring={"score": scorer},
    521     cv=cv,
    522     n_jobs=n_jobs,
    523     verbose=verbose,
    524     fit_params=fit_params,
    525     pre_dispatch=pre_dispatch,
    526     error_score=error_score,
    527 )
    528 return cv_results["test_score"]
File ~/miniforge3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:266, in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    263 # We clone the estimator to make sure that all the folds are
    264 # independent, and that it is pickle-able.
    265 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 266 results = parallel(
    267     delayed(_fit_and_score)(
    268         clone(estimator),
    269         X,
    270         y,
    271         scorers,
    272         train,
    273         test,
    274         verbose,
    275         None,
    276         fit_params,
    277         return_train_score=return_train_score,
    278         return_times=True,
    279         return_estimator=return_estimator,
    280         error_score=error_score,
    281     )
    282     for train, test in cv.split(X, y, groups)
    283 )
    285 _warn_or_raise_about_fit_failures(results, error_score)
    287 # For callabe scoring, the return type is only know after calling. If the
    288 # return type is a dictionary, the error scores can now be inserted with
    289 # the correct key.
File ~/miniforge3/lib/python3.10/site-packages/joblib/parallel.py:1085, in Parallel.call(self, iterable)
   1076 try:
   1077     # Only set self._iterating to True if at least a batch
   1078     # was dispatched. In particular this covers the edge
   (...)
   1082     # was very quick and its callback already dispatched all the
   1083     # remaining jobs.
   1084     self._iterating = False
-> 1085     if self.dispatch_one_batch(iterator):
   1086         self._iterating = self._original_iterator is not None
   1088     while self.dispatch_one_batch(iterator):
File ~/miniforge3/lib/python3.10/site-packages/joblib/parallel.py:873, in Parallel.dispatch_one_batch(self, iterator)
    870 n_jobs = self._cached_effective_n_jobs
    871 big_batch_size = batch_size * n_jobs
--> 873 islice = list(itertools.islice(iterator, big_batch_size))
    874 if len(islice) == 0:
    875     return False
File ~/miniforge3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:266, in <genexpr>(.0)
    263 # We clone the estimator to make sure that all the folds are
    264 # independent, and that it is pickle-able.
    265 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 266 results = parallel(
    267     delayed(_fit_and_score)(
    268         clone(estimator),
    269         X,
    270         y,
    271         scorers,
    272         train,
    273         test,
    274         verbose,
    275         None,
    276         fit_params,
    277         return_train_score=return_train_score,
    278         return_times=True,
    279         return_estimator=return_estimator,
    280         error_score=error_score,
    281     )
    282     for train, test in cv.split(X, y, groups)
    283 )
    285 _warn_or_raise_about_fit_failures(results, error_score)
    287 # For callabe scoring, the return type is only know after calling. If the
    288 # return type is a dictionary, the error scores can now be inserted with
    289 # the correct key.
File ~/miniforge3/lib/python3.10/site-packages/sklearn/model_selection/_split.py:86, in BaseCrossValidator.split(self, X, y, groups)
     84 X, y, groups = indexable(X, y, groups)
     85 indices = np.arange(_num_samples(X))
---> 86 for test_index in self._iter_test_masks(X, y, groups):
     87     train_index = indices[np.logical_not(test_index)]
     88     test_index = indices[test_index]
File ~/miniforge3/lib/python3.10/site-packages/sklearn/model_selection/_split.py:98, in BaseCrossValidator._iter_test_masks(self, X, y, groups)
     93 def _iter_test_masks(self, X=None, y=None, groups=None):
     94     """Generates boolean masks corresponding to test sets.
     95 
     96     By default, delegates to _iter_test_indices(X, y, groups)
     97     """
---> 98     for test_index in self._iter_test_indices(X, y, groups):
     99         test_mask = np.zeros(_num_samples(X), dtype=bool)
    100         test_mask[test_index] = True
File ~/miniforge3/lib/python3.10/site-packages/sklearn/model_selection/_split.py:163, in LeaveOneOut._iter_test_indices(self, X, y, groups)
    161 n_samples = _num_samples(X)
    162 if n_samples <= 1:
--> 163     raise ValueError(
    164         "Cannot perform LeaveOneOut with n_samples={}.".format(n_samples)
    165     )
    166 return range(n_samples)
ValueError: Cannot perform LeaveOneOut with n_samples=1.

Curiously, if I duplicate the contents of X and y, the error goes away, and a score of 0.0 is outputted:

# input list of values
x = [[2, 5, 4, 3, 4, 6, 7, 5, 8, 9], [2, 5, 4, 3, 4, 6, 7, 5, 8, 9]]
define the output as the mean of the inputs, as specified in the question
y = [np.mean(x),np.mean(x)]
...
...
...
>>> 0.0

Why is that?

score 2 · Accepted Answer · answered Nov 18 '22 at 22:54

You have not interpreted the problem correctly.

I will try to explain using your example, with the array [1, 2, 3].

Because there are only 3 samples, the cross validation is called "leave one out".

First fold, elements [1, 2] are used for training and [3] for testing.
The mean of the train elements is 1.5, so the prediction is 1.5, so the absolute error is 3-1.5 = 1.5.

Similarly we repeat by choosing 2 and 1 as the test elements and the other two as train.

Mean of 1 and 3: 2, absolute error = 2-2 = 0
Mean of 2 and 3: 2.5, absolute error = |1 - 2.5| = 1.5

So, the mean absolute error will be mean([1.5, 0, 1.5]) = 1.0.

You tried to think about the problem as a usual machine learning problem with tabular data, but essentially your X is not a row (the problem statement mentions that the input is an array, but you define it as a 2D array in the code), it is a column which happens to be both your feature, and the target variable, and the model you have to use is simply y_pred = np.mean(x).

The following snippet does not use library functions (well, apart from np.mean) and is easy to understand:

import numpy as np
def model(X):
  return np.mean(X)
def cross_validation(X, model):
    errors = []
    for i in range(len(X)):
        test_element = X[i]
        train_elements = X[0:i] + X[i+1:len(X)]
    prediction = model(train_elements)
    error = abs(prediction - test_element)
    errors.append(error)

return np.mean(errors)



arr1 = [1,2,3]
arr2 = [2, 5, 4, 3, 4, 6, 7, 5, 8, 9]
print(cross_validation(arr1, model))
print(cross_validation(arr2, model))

and produces

1.0
1.9555555555555557

What should the target variable (y) look like here?

input list of values

define the output as the mean of the inputs, as specified in the question

build multiple linear regression model

define cross-validation method to use

use LOOCV to evaluate model

view mean absolute error

>>>

define the output as the mean of the inputs, as specified in the question

1 Answers1