12

I have a pandas data frame (X11) like this: In actual I have 99 columns up to dx99

    dx1      dx2    dx3    dx4
0   25041   40391   5856    0
1   25041   40391   25081   5856
2   25041   40391   42822   0
3   25061   40391   0       0
4   25041   40391   0       5856
5   40391   25002   5856    3569

I want to create additional column(s) for cell values like 25041,40391,5856 etc. So there will be a column 25041 with value as 1 or 0 if 25041 occurs in that particular row in any dxs columns. I am using this code and it works when number of rows are less.

mat = X11.as_matrix(columns=None)
values, counts = np.unique(mat.astype(str), return_counts=True)

for x in values:
    X11[x] = X11.isin([x]).any(1).astype(int)

I am getting result like this:

dx1     dx2     dx3    dx4  0   25002   25041   25061   25081   3569    40391   42822   5856
25041   40391   5856    0   0   0       1       0       0       0          1        0       1
25041   40391   25081  5856 0   0       1       0       1       0            1      0       1
25041   40391   42822   0   0   0       1       0       0       0           1       1       0
25061   40391   0       0   0   0       0       1       0       0          1        0       0
25041   40391   0    5856   0   0       1       0       0       0          1        0       1
40391   25002 5856   3569   0   1       0       0       0       1          1        0       1

When number of rows are many thousands or in millions, it hangs and takes forever and I am not getting any result. Please see that cell values are not unique to column, instead repeating in multi columns. For ex, 40391 is occurring in dx1 as well as in dx2 and so on for 0 and 5856 etc. Any idea how to improve the logic mentioned above?

Sanoj
  • 251
  • 1
  • 2
  • 6

2 Answers2

7

There's a much more pythonic solution in pandas...

This takes less than a second on 10 Million rows on my laptop:

for x in X11.E.unique():
    X11[x]=(X11.E==x).astype(int)
X11

Here are the details laid out:

Simple small dataframe -

import numpy as np
import pandas as pd

X11 = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
X11['E'] = [25223, 112233,25223,14333,14333,112233]
X11

simple small dataframe

Binarization method -

for x in X11.E.unique():
    X11[x]=(X11.E==x).astype(int)
X11

enter image description here

Dataframe with 10 Million rows -

pd.set_option("display.max_rows",20)
X12 = pd.DataFrame(np.random.randn(10000000,4), columns=list('ABCD'))
foo = [25223, 112233,25223,14333,14333,112233]
bar=[]
import random
for x in range(10000000):
    bar.append(random.choice(foo))
X12['E'] = bar
X12

enter image description here

Timed binarization (aka one-hot encoding) on 10 million row dataframe -

import time
start = time.clock()

for x in X12.E.unique():
    X12[x]=(X12.E==x).astype(int)
elapsed = (time.clock() - start)

print "This is the time that this took in seconds: ",elapsed

X12

enter image description here

Hope this helps!

AN6U5
  • 6,878
  • 1
  • 26
  • 42
5

It looks like you want to create dummy variable from a pandas dataframe column. Fortunately, pandas has a special method for it: get_dummies(). Here is a code snippet that you can adapt for your need:

import pandas as pd
data = pd.read_clipboard(sep=',')

#get the names of the first 3 columns
colN = data.columns.values[:3]

#make a copy of the dataframe
data_transformed = data

#the get_dummies method is doing the job for you
for column_name in colN:
    dummies = pd.get_dummies(data_transformed[column_name], prefix='value', prefix_sep='_')
    col_names_dummies = dummies.columns.values

    #then you can append new columns to the dataframe
    for i,value in enumerate(col_names_dummies):
        data_transformed[value] = dummies.iloc[:,i]

Here is the output of data_transformed:

         dx1    dx2    dx3   dx4    dx5    dx6    dx7  value_25041  value_25061  0  25041  40391   5856     0  V4511  V5867  30000            1            0   
    1  25041  40391  25081  5856   5363   3572      0            1            0   
    2  25041  40391  42822     0   5856      0      0            1            0   
    3  25061  40391      0     0      0      0      0            0            1   
    4  25041  40391      0  5856  25081  V4511  25051            1            0   

      value_40391  value_0  value_5856  value_25081  value_42822  
    0            1        0           1            0            0  
    1            1        0           0            1            0  
    2            1        0           0            0            1  
    3            1        1           0            0            0  
    4            1        1           0            0            0  
michaelg
  • 296
  • 1
  • 5