Generate predictions that are orthogonal (uncorrelated) to a given variable

Question

I have an X matrix, a y variable, and another variable ORTHO_VAR. I need to predict the y variable using X, however, the predictions from that model need to be orthogonal to ORTHO_VAR while being as correlated with y as possible.

I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.

This code:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
TARGET = 'target'
PRED = 'yhat'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var{}'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df[PRED] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()[PRED][ORTHO_VAR]
assert pred_corr_w_ortho < 0.01, "Correlation score: {0} is superior to the given threshold.".format(pred_corr_w_ortho)

Returns this:

---------------------------------------------------------------------------
AssertionError                           
      1 pred_corr_w_ortho = df.corr().abs()[PRED][ORTHO_VAR]
----> 2 assert pred_corr_w_ortho < 0.01, "Correlation score: {0} is superior to the given threshold.".format(pred_corr_w_ortho)

AssertionError: Correlation score: 0.5895885756753665 is superior to the given threshold.

...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR

score 6 · Accepted Answer · answered Apr 15 '19 at 13:09

This requirement can be satisfied by adding sufficient noise to predictions $\hat{y}$ to decorrelate them from orthogonal values $v$. Ideally, if $\hat{y}$ is already decorrelated from $v$, no noise would be added to $\hat{y}$, thus $\hat{y}$ would be maximally correlated with $y$.

Mathematically, we want to create $\hat{y}'=\hat{y}+\epsilon$ from $\epsilon \sim \mathcal{N}(0, \sigma_{\epsilon})$, to satisfy $$r_{\hat{y}'v} = \frac{\sigma_{\hat{y}'v}}{\sigma_{\hat{y}'}\sigma_{v}} < \delta$$ for arbitrary threshold $\delta$. Now, lets expand this inequality to find a lower-bound for std of noise $\epsilon$, i.e. $\sigma_{\epsilon}$. $$\begin{align*} \sigma_{\hat{y}'}^2&=\sigma_{\hat{y}}^2 + \sigma_{\epsilon}^2,\\ \sigma_{\hat{y}'v}&={\Bbb E}\left[(\hat{y}+\epsilon - \mu_{\hat{y}} - \overbrace{\mu_{\epsilon}}^{=0})(v-\mu_{v})\right]\\ &={\Bbb E}\left[(\hat{y} - \mu_{\hat{y}})(v-\mu_{v})\right]+\overbrace{{\Bbb E}\left[\epsilon(v-\mu_{v})\right]}^{=0}&\\ &=\sigma_{\hat{y}v},\\ r_{\hat{y}'v} &= \frac{\sigma_{\hat{y}'v}}{\sigma_{\hat{y}'}\sigma_{v}} =\frac{\sigma_{\hat{y}v}}{\sigma_{v} \sqrt{\sigma_{\hat{y}}^2+\sigma_{\epsilon}^2}} < \delta\\ &\Rightarrow \sigma_{\hat{y}}\sqrt{\left(\frac{r_{\hat{y}v}}{\delta}\right)^2 - 1} < \sigma_{\epsilon} \end{align*}$$

Since all the variables in the left side of inequality can be calculated, we can sample noises from $\mathcal{N}(0, \sigma_{\epsilon})$ and add them to $\hat{y}$ to satisfy the original inequality.

Here is a code that does the exact same thing:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var{}'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

# add noise
np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises)  # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var

which outputs:

delta:  0.01
std_y_hat:  69.48568725585938
corr_y_hat_target:  0.8207672834857673
corr_y_hat_ortho_var:  0.7663936356880843
std_noise_lower_bound:  5324.885500165032
std_noise:  5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std:  5373.914830034988
new corr_y_hat_target:  -0.004125016071865934
new corr_y_hat_ortho_var:  -0.000541131379457552

You can experiment with other deltas. By comparing std_y_hat with std_noise_lower_bound, you can see that a huge noise must be added to $\hat{y}$ to decorrelate it from $v$ bellow $0.01$, which dramatically decolerates $\hat{y}$ from $y$ too.

Note: Assertion might fail for too small thresholds $\delta$ due to insufficient sample count.

Generate predictions that are orthogonal (uncorrelated) to a given variable

1 Answers1