11

I have a data set of movies which has 28 columns. One of them is genres. For each row in this data set, the value for column genres is of the form "Action|Animation|Comedy|Family|Fantasy". I want to encode them using pandas.get_dummies() but since the columns have multiple values, how to deal with such conditions? Additinal information on below link(question moved from stackoverflow) https://stackoverflow.com/q/40331558/4028904

aks_Nin
  • 111
  • 1
  • 1
  • 4

1 Answers1

12

I'm starting with the following dataset:

import pandas as pd
data = pd.DataFrame({'title': ['Avatar', 'Pirates', 'Spectre', 'Batman'],
                 'genres': ['Action|Adventure|Fantasy|Sci-Fi',
                            'Action|Adventure|Fantasy',
                            'Action|Adventure|Thriller',
                            'Action|Thriller']},
                columns=['title', 'genres'])


     title                           genres
0   Avatar  Action|Adventure|Fantasy|Sci-Fi
1  Pirates         Action|Adventure|Fantasy
2  Spectre        Action|Adventure|Thriller
3   Batman                  Action|Thriller

First, you want to have your data in a structure pairing titles with one genre at a time, multiple rows per title. You can get it in a series like this:

cleaned = data.set_index('title').genres.str.split('|', expand=True).stack()


title
Avatar   0       Action
         1    Adventure
         2      Fantasy
         3       Sci-Fi
Pirates  0       Action
         1    Adventure
         2      Fantasy
Spectre  0       Action
         1    Adventure
         2     Thriller
Batman   0       Action
         1     Thriller
dtype: object

(There's an extra index level that we don't want, but we'll get rid of it soon.) get_dummies will now work, but it only works on one row at a time, so we need to re-aggregate the titles:

pd.get_dummies(cleaned, prefix='g').groupby(level=0).sum()


         g_Action  g_Adventure  g_Fantasy  g_Sci-Fi  g_Thriller
title
Avatar        1.0          1.0        1.0       1.0         0.0
Batman        1.0          0.0        0.0       0.0         1.0
Pirates       1.0          1.0        1.0       0.0         0.0
Spectre       1.0          1.0        0.0       0.0         1.0
philh
  • 221
  • 1
  • 5