how to fix left and right skewness

Question

I know that left and right skewness means it has a long tail on either the left(left skewness) or right(right skewness).

However, the example below is an example of right skewness.

data = pd.DataFrame({
  "Income": [15000, 22000, 30000, 35000, 42000, 50000, 65000, 78000, 120000, 250000]
})
sns.distplot( data['Income'] , color="skyblue", label="Sepal Length")

Now my understanding of right skewness is fewer data points on the increasing value of the x-axis. So is my understanding correct? and how do I fix this chart to be a normal distribution?

AFAIK, using central limit theorem I should be able to convert any shape to standard normal distribution but it requires a minimum sample size of 30. So in my case is there any other way or adding more data points is the only way?

score 2 · Accepted Answer · answered Oct 21 '24 at 06:49

You can transform your data to achieve a more normal distribution using the Yeo-Johnson Power Transformer, which also scales the data effectively. Here’s the code to implement it:

import pandas as pd  
from sklearn.preprocessing import PowerTransformer  
import matplotlib.pyplot as plt 
import seaborn as sns
data = pd.DataFrame({
  "Income": [15000, 22000, 30000, 35000, 42000, 50000, 65000, 78000, 120000, 250000]
})
Apply Yeo-Johnson Power Transformation to the "Income" data
data["Income_Normalized"] = PowerTransformer(method='yeo-johnson').fit_transform(data[["Income"]])
Plotting the distributions
plt.figure(figsize=(12, 6))
for i, col in enumerate(["Income", "Income_Normalized"], 1):
    plt.subplot(1, 2, i)
    sns.histplot(data[col], bins=10, kde=True)
    plt.title(f'{"Normalized" if col == "Income_Normalized" else "Original"} Income Distribution')
    plt.xlabel('Income' if col == "Income" else 'Normalized Income')
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

Here are the graphs showing the original and normalized income distributions:

how to fix left and right skewness

1 Answers1

Apply Yeo-Johnson Power Transformation to the "Income" data

Plotting the distributions