I have a data set of movies which has 28 columns. One of them is genres. For each row in this data set, the value for column genres is of the form "Action|Animation|Comedy|Family|Fantasy". I want to encode them using pandas.get_dummies() but since the columns have multiple values, how to deal with such conditions? Additinal information on below link(question moved from stackoverflow) https://stackoverflow.com/q/40331558/4028904
Asked
Active
Viewed 1.6k times
1 Answers
12
I'm starting with the following dataset:
import pandas as pd
data = pd.DataFrame({'title': ['Avatar', 'Pirates', 'Spectre', 'Batman'],
'genres': ['Action|Adventure|Fantasy|Sci-Fi',
'Action|Adventure|Fantasy',
'Action|Adventure|Thriller',
'Action|Thriller']},
columns=['title', 'genres'])
title genres
0 Avatar Action|Adventure|Fantasy|Sci-Fi
1 Pirates Action|Adventure|Fantasy
2 Spectre Action|Adventure|Thriller
3 Batman Action|Thriller
First, you want to have your data in a structure pairing titles with one genre at a time, multiple rows per title. You can get it in a series like this:
cleaned = data.set_index('title').genres.str.split('|', expand=True).stack()
title
Avatar 0 Action
1 Adventure
2 Fantasy
3 Sci-Fi
Pirates 0 Action
1 Adventure
2 Fantasy
Spectre 0 Action
1 Adventure
2 Thriller
Batman 0 Action
1 Thriller
dtype: object
(There's an extra index level that we don't want, but we'll get rid of it soon.) get_dummies will now work, but it only works on one row at a time, so we need to re-aggregate the titles:
pd.get_dummies(cleaned, prefix='g').groupby(level=0).sum()
g_Action g_Adventure g_Fantasy g_Sci-Fi g_Thriller
title
Avatar 1.0 1.0 1.0 1.0 0.0
Batman 1.0 0.0 0.0 0.0 1.0
Pirates 1.0 1.0 1.0 0.0 0.0
Spectre 1.0 1.0 0.0 0.0 1.0
philh
- 221
- 1
- 5