5

I have a groupby in jupyter-notebook that takes ages to run and after 10 minutes of running it says 'kernel died...',

The groupby looks like this:

df1.groupby(['date', 'unit', 'company', 'city'])['col1',
'col2',
'col3',
'col4',
  ...
'col20'].mean()

All of the 'col' columns are float values. I am running everything locally. Any ideas?

UPDATE:

The shape of df1 is:

(1360, 24)

Memory and dtypes:

dtypes: category(3), datetime64[ns](2), float64(17), int64(2)
memory usage: 266.9 KB

The unique size of city, date, company, unit:

len(df1.date.unique()) = 789
len(df1.unit.unique()) = 76
len(df1.company.unique()) = 205
len(df1.city.unique()) = 237

I have 16GB of memory on MacBook Pro.

UPDATE 2:

It works only if I have date and unit inside the groupby columns as the only 2 columns. If I add either a company or city, it doesn't work anymore, it keeps running indefinitely.

n1k31t4
  • 15,468
  • 2
  • 33
  • 52
BlueIvy
  • 77
  • 1
  • 6

3 Answers3

1

I thought it might be because of the different types being used in the columns, but I created an example below, which works fine over mixed column types. The only real different is the size - that is why I think you are probably running out of memory.

Working example

I use int, str and datetime objects:

In [1]: import pandas as pd                                                                                                                                                                                                                          

In [2]: import datetime                                                                                                                                                                                                                              

In [3]: df = pd.DataFrame({'Branch': 'A A A A A A A B'.split(),
                           'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
                           'Quantity': [1, 3, 5, 1, 8, 1, 9, 3],
                           'Date':[datetime.datetime(2013, 1, 1, 13, 0),
                                   datetime.datetime(2013, 1, 1, 13, 5),
                                   datetime.datetime(2013, 10, 1, 20, 0),
                                   datetime.datetime(2013, 10, 2, 10, 0), 
                                   datetime.datetime(2013, 10, 1, 20, 0),
                                   datetime.datetime(2013, 10, 2, 10, 0),
                                   datetime.datetime(2013, 12, 2, 12, 0),
                                   datetime.datetime(2013, 12, 2, 14, 0)]})                                                                                                                                                                                                                                          

In [4]: df                                                                                                                                                                                                                                           
Out[4]: 
  Branch Buyer  Quantity                Date
0      A  Carl         1 2013-01-01 13:00:00
1      A  Mark         3 2013-01-01 13:05:00
2      A  Carl         5 2013-10-01 20:00:00
3      A  Carl         1 2013-10-02 10:00:00
4      A   Joe         8 2013-10-01 20:00:00
5      A   Joe         1 2013-10-02 10:00:00
6      A   Joe         9 2013-12-02 12:00:00
7      B  Carl         3 2013-12-02 14:00:00


In [5]: df.shape                                                                                                                                                                                                                                    
Out[5]: (8, 4)

Now I just repeat the dataframe again, but add one hour to each of the datetime values, just to increase the number of groupby combinations to expect:

In [14]: df.iloc[0:8, 3] += datetime.timedelta(hours=1)                                                                                                                                                                                              

Now perform a groupby over all columns, and sum only on Quantity (it is my only numeric column).

The reuslts are as expected:

In [16]: df.groupby(["Branch", "Buyer", "Quantity", "Date"])["Quantity"].sum()                                                                                                                                                                       
Out[16]: 
Branch  Buyer  Quantity  Date               
A       Carl   1         2013-01-01 13:00:00    1
                         2013-01-01 14:00:00    1
                         2013-10-02 10:00:00    1
                         2013-10-02 11:00:00    1
               5         2013-10-01 20:00:00    5
                         2013-10-01 21:00:00    5
        Joe    1         2013-10-02 10:00:00    1
                         2013-10-02 11:00:00    1
               8         2013-10-01 20:00:00    8
                         2013-10-01 21:00:00    8
               9         2013-12-02 12:00:00    9
                         2013-12-02 13:00:00    9
        Mark   3         2013-01-01 13:05:00    3
                         2013-01-01 14:05:00    3
B       Carl   3         2013-12-02 14:00:00    3
                         2013-12-02 15:00:00    3
Name: Quantity, dtype: int64

Break your problem down

It might be difficult to break down your problem, because you need to whole data for the groupby operation. You could however save each of the groups to disk, perform the mean() computation on them separately and merge the results yourself. The name of each group is actually the combination of the groupby columns selected. This can be used to build the index of the reuslting dataframe.

It could look something like this:

for name, group in df1.groupby(['date', 'unit', 'company', 'city']):
    print("Processing groupby combination: ", name)    # This is the current groupby combination
    result = group.mean()
    _df = pd.DataFrame(index=[name], data=[result])
    _df.to_csv("path/somewhere/" + name + ".csv

You will then have a folder full of the results for each group and will have to just read them back in and combine them.

Other methods

It is known that Pandas does not handle many operations on huge datasets very efficienty (compared to e.g. the data.table package). There is the Dask package, which essentially does Pandas things in a distributed manner, but that might be overkill (and you'll of course need more resources!)

n1k31t4
  • 15,468
  • 2
  • 33
  • 52
1

I bet that company and unit are category type?

I couldn't explain the underlying reason, but groupby doesn't like the category type.

Change your column type to 'object' and it will run in a couple of milliseconds without consuming any memory

Begoodpy
  • 233
  • 2
  • 5
0

len(df1.date.unique()) = 789
len(df1.unit.unique()) = 76
len(df1.company.unique()) = 205
len(df1.city.unique()) = 237

That gives 2,913,350,940 possible combinations, yet you say your dataframe has only 1360 rows.

Try creating a column that is built from concatinating the four columns, then doing groupby that.

Acccumulation
  • 311
  • 1
  • 3