3

I’m still learning data science and trying to improve my understanding of statistical tests. Right now, I’m working with a dataset where I have a categorical feature (e.g., “School Type” with values like Public, Private…) and a numeric target (e.g., student scores). However, the numeric target is not normally distributed.

  1. What are the best statistical tests to measure the correlation between a categorical variable and a non-normally distributed numeric target? I’ve seen tests like ANOVA (which assumes normality) and Kruskal-Wallis (which is non-parametric), but I’m not sure which is the best choice in different scenarios. Are there other tests I should consider?

  2. Once I calculate the correlation, how can I determine how each category affects the target? For example, how do I find out which categories have a positive or negative effect on student scores? Should I compare medians, use effect sizes, or apply another method?

I’d really appreciate any insights or recommendations.

Gab
  • 31
  • 2

1 Answers1

3

ANOVA is a good method for hypothesis testing. What I can infer is explained below through an example.

\begin{array} {|r|r|}\hline & A & B & C \\ \hline Public & 58 & 56 & 58 \\ \hline Private & 64 & 60 & 58 \\ \hline HomeSchool & 59 & 59 & 60 \\ \hline \end{array}

What I can understand, for $1^{st}$ query you should try ANOVA as $ \sigma_{public} = \sigma_{private} = \sigma_{home}$, etc, for correlation. Normality only becomes an issue if the dataset is small. For large data, it starts to act as what is stated in the Central Limit theorem. Kruskal-Wallis is used for ranking-based testing, student scores in no way can be used for ordinality.

For $2^{nd}$ query, you may try mean, mode, and median for comparison.

Aviral Verma
  • 919
  • 1
  • 4