2

I would greatly appreciate let me know how to plot a heatmap-like plot for categorical features?

In fact, based on this post, the association between categorical variables should be computed using Crammer's V. Therefore, I found the following code to plot it, but I don't know why he plotted it for "contribution", which is a numeric variable?

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorical-categorical association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))


cols = ["Party", "Vote", "contrib"]
corrM = np.zeros((len(cols),len(cols)))
# there's probably a nice pandas way to do this
for col1, col2 in itertools.combinations(cols, 2):
    idx1, idx2 = cols.index(col1), cols.index(col2)
    corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
    corrM[idx2, idx1] = corrM[idx1, idx2]

corr = pd.DataFrame(corrM, index=cols, columns=cols)
fig, ax = plt.subplots(figsize=(7, 6))
ax = sns.heatmap(corr, annot=True, ax=ax); ax.set_title("Cramer V Correlation between Variables");

I also found Bokeh. However, I am not sure if it uses Crammer's V to plot the heatmap or not?

Really, I have two categorical features: the first one has 2 categories and the second one has 37 categories.

I need the plot will be like the two last plots presented here, but also display the association values on it too.
Thanks in advance.

ebrahimi
  • 1,305
  • 7
  • 20
  • 40

2 Answers2

0

It might not be useful to plot the relationship between categorical features. The visualization would imply an ordering to categorical values which might not lead to incorrect interpretations.

A more useful option might be a contingency table. One feature would be in the rows, another feature would be in the columns. The cells would be the counts of co-occurrence.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
0

If your goal is to have a color representation of the contingency table then you can use pd.crosstab along with background_gradient like this:

import pandas as pd

data = { 'City': ['City1', 'City2', 'City1', 'City2', 'City1', 'City2', 'City3', 'City2', 'City1', 'City2', 'City1', 'City2', 'City3', 'City3', 'City1', 'City2', 'City1', 'City2', 'City1', 'City2'], 'Sales': [100, 200, 200, 200, 100, 100, 400, 400, 500, 500, 100, 100, 200, 300, 400, 200, 400, 300, 100, 100] } df = pd.DataFrame(data)

df_cross = pd.crosstab(df["Sales"], df["City"]) df_cross.style.background_gradient(vmin=df_cross.values.min(), vmax=df_cross.values.max())

dmayilyan
  • 33
  • 7