2

I have two datasets with 20 features, but with different feature distributions (DS_A and DS_B). How can I sample the DS_A to make its distribution similar to DS_B, with respect to multiple features??

I check the similarity/difference of two datasets by checking individual features from DS_A against DS_B, in shape, and percentiles. Features are mostly numerical, some binary, some normalized.


Background:

Some time ago I trained a model using dataset DS_B as ground truth. Now, I want to retrain the model with more recent data and see if the performance improves. The new ground truth data I collect is DS_A, but due to practical reasons, new data is collected somewhat differently, and hence the feature distribution in the new data set is different from the old data set.

cybergeek654
  • 121
  • 3

1 Answers1

2

One simple way is to transform your distribution linearly. That should work fine if the distribution of your data has changed approximately linearly.

Question is how to change the distribution of DS_A to match distribution of DS_B, with respect to multiple features?

That being said you could transform your feature distribution DS_A by

  1. Substructing mean(DS_A) and add the mean(DS_B)
  2. Divide with the standard deviation of DS_A and multiply with the standard deviation of DS_B

Long story short:You change the mean and the standard Deviation of the DS_A distribution to match the DS_B.

Here is a code in python that apply this transformation to two gaussian distributions

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

m1, m2 = 0, 10
s1, s2 = 1, 3

x1 = np.random.normal(m1,s1, 1000)
x2 = np.random.normal(m2,s2, 1000)

sns.distplot(x1, hist_kws=dict(alpha=0.1), color='red', label='Distribution 1')
sns.distplot(x2, hist_kws=dict(alpha=0.1), color='green', label='Distribution 2')

estimated_x1_mean = np.mean(x1)
estimated_x1_sd   = np.std(x1)
estimated_x2_mean = np.mean(x2)
estimated_x2_sd   = np.std(x2)

x2_new = (x2 - estimated_x2_mean + estimated_x1_mean)  * estimated_x1_sd / estimated_x2_sd
sns.distplot(x2_new, color='blue', hist_kws=dict(alpha=0.1, edgecolor='black'), label='Distribution 2 after Transformation')
plt.legend()

And the result enter image description here

Giannis Krilis
  • 501
  • 2
  • 7