10

I have two data frames df1 and df2 which look something like this.

    cat1    cat2  cat3
0   10       25     12  
1   11       22     14
2   12       30     15

   all_cats  cat_codes
0   10       A     
1   11       B 
2   12       C
3   25       D
4   22       E
5   30       F
6   14       G

I would like a DataFrame where each column in df1 is created but replaced with cat_codes. Column header names are different. I have tried join and merge but my number of rows are inconsistent. I am dealing with huge number of samples (100,000). My output should ideally be this:

    cat1    cat2  cat3
0    A        D     C  
1    B        E     Y
2    C        F     Z

The resulting columns should be appended to df1.

Danny
  • 1,166
  • 1
  • 8
  • 16

2 Answers2

8

You can convert df2 to a dictionary and use that to replace the values in df1

cat_1 = [10, 11, 12]
cat_2 = [25, 22, 30]
cat_3 = [12, 14, 15]

df1 = pd.DataFrame({'cat1':cat_1, 'cat2':cat_2, 'cat3':cat_3})

all_cats = [10, 11, 12, 25, 22, 30, 15]
cat_codes = ['A', 'B', 'C', 'D', 'E', 'F', 'G']

df2 = pd.DataFrame({'all_cats':all_cats, 'cat_codes':cat_codes})

rename_dict = df2.set_index('all_cats').to_dict()['cat_codes']

df1 = df1.replace(rename_dict)

If you still have some values that aren't in your dictionary and want to replace them with Z, you can use a regex to replace them.

df1.astype('str').replace({'\d+': 'Z'}, regex=True)
6
df3 = pd.merge(df1,df2,left_on=['cat'+str(i)], right_on = ['cat_codes'], how = 'left')

I would iterate this for cat1,cat2 and cat3. This does not replace the existing column values but appends new columns.

Danny
  • 1,166
  • 1
  • 8
  • 16