2

I am currently working on a project where the data concerns people and the dataset contain personal data with sensitive attributes. (typically: age, sex, handicap, race).

Now it seems there are mainly three options for modelling:

  • Not take the protected attributes into the features. It is usually considered slightly bad as you can have hidden correlation.
  • Taking the features into account in the model but doing nothing after. This is usually considered very bad as the model will explicitly learn biases.
  • Taking the features into account in the model, then correcting the action taken, based on those protected attributes to ensure fairness.

I am curious if there is a general rules of thumbs for evaluating the impact of model performance on the bias. Basically an argument can be made that a better model (think gbdt over linear model) will be better overall. The counter argument could also be True, depending on the approach, mainly in the last case, because in the first two cases a better model will better learn biases, hidden or not.

Is there any rule of thumbs regarding the issue, and the possible need to implement a better model?

Lucas Morin
  • 2,775
  • 5
  • 25
  • 47

1 Answers1

-1

There is no general answer to your question. It depends on your dataset and the problem you are trying to solve with the model. What I can tell you, generally, is that using an advanced gradient boosting model will be better than using a linear model.

Keep in mind that your question is a popular discussion in ethics in data science. In your case, I would suggest developing a model that uses the features - age, sex, handicap, race. And another model that does not use them. The model that uses them (it highly depends on the Y variable) should have higher accuracy. Bias in statistics is often different than what we mean by bias in our everyday life.

Imagine two models. Model One classifies a user based many features, including sex and age - we would call that model very biased (as in the non-DS term), however purely scientifically this model can be actually less biased than Model Two - that uses the same features, except this time sex and age are removed. Why you may ask?

Well the second model will give higher feature importance to the remaining features. So if one of the features is let's say Eating Healthy or Unhealthy foods - the model will find this more important than in the first model. And here by trying to remove bias, you create more bias.

Back to your question. Make two models. You do not have to polish them or optimize them to see the differences. Early versions will be good enough to see the differences.

Also keep in mind that unfortunately science is not always politically correct. Sometimes genes in specific races are more prone to some diseases. Sometimes ages is an important factor when predicting credit risk. We might find classifying people based on these factors offensive, but that's how science is.