How to find and calculate correlation in a data set which has category and continuous variables?

Question

I am working on an Insurance domain use case to predict if an existing customer will buy a second insurance policy or not. I have a few personal details saved under different categories like Marital status, Smoker (Yes or No), Age (Young, Adult, Senior Citizen), Gender (Male/Female) and few are continuous variables like Premium Paid, Sum Insured.

My target is to use this mix set of categorical and continuous variables and predict the class ( 1 - Will buy a second policy, 0 - Will not buy a second policy). So how can I find/compute the correlation in this dataset and pick only the significant ones to use in Logistic Regression formula for classification?

Will appreciate if someone can provide articles, link to a similar piece of work done in Python.

Jonathan · Answer 1 · 2019-11-18T12:36:31.860

Regarding your question about Python implementations of the given R examples: SKlearn has ready to use implementations for feature selection as they were described under the linked question in R (see here).

Here is an example for categorical input and output data: With SelectKBest you can select the K features with the highest corelation, e.g. based on a chi squared test.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# load the famous iris data set for which X.shape is (150, 4) and y.shape (150,)
iris = load_iris()
X, y = iris.data, iris.target

# Add exponentially distributed noise (20 new attributes)
rng = np.random.RandomState()
noise = rng.exponential(size=(len(iris.data), 20))
# While X.shape is (150, 4), X_noisy has shape (150, 24)
X_noisy = np.hstack([iris.data, noise])

# Select 4 features based on chi squared test
selector = SelectKBest(chi2, k=4)
selector.fit(X_noisy, y)
X_selected = selector.transform(X_noisy)

Checking the shapes gives the following:

X.shape
Out[113]: (150, 4)

X_noisy.shape
Out[114]: (150, 24)

X_selected.shape
Out[115]: (150, 4)

You can also check which features were selected:

print(selector.get_support())
[False False  True  True False False False False False  True False False
 False False False False False False False False False False  True False]

As you can see 2 of the initial 4 features (which were not noise) did get selected. The feature selectors also have attributes to check for examples the p values (see here for SelectKBest).

The book 'Introduction to machine learning with Python' by Mueller and Guido has a section about it too (their example is very close the one above).

However, the example of a CHI squared test I gave is applicable to categorical independent and dependent variables. For a mix of categorical and continuous independent variables you might need to discretize the continuous variables or check other methods for features selection, e.g. model-based.

How to find and calculate correlation in a data set which has category and continuous variables?

1 Answers1