简体   繁体   中英

Optimizing Pandas groupby/apply

I am writing a process which takes a semi-large file as input (~4 million rows, 5 columns) and performs a few operations on it.
Columns:
- CARD_NO
- ID
- CREATED_DATE
- STATUS
- FLAG2

I need to create a file which contains 1 copy of each CARD_NO where STATUS = '1' and CREATED_DATE is the maximum of all CREATED_DATEs for that CARD_NO.
I succeeded but my solution is very slow (3h and counting as of right now.)
Here is my code:

file = 'input.csv'
input = pd.read_csv(file)

input = input.drop_duplicates()


card_groups = input.groupby('CARD_NO', as_index=False, sort=False).filter(lambda x: x['STATUS'] == 1)


def important(x):
    latest_date = x['CREATED_DATE'].values[x['CREATED_DATE'].values.argmax()]
    return x[x.CREATED_DATE == latest_date]

#where the major slowdown occurs
group_2 = card_groups.groupby('CARD_NO', as_index=False, sort=False).apply(important)

path = 'result.csv'
group_2.to_csv(path, sep=',', index=False)
# ~4 minutes for the 154k rows file
# 3+ hours for ~4m rows

I was wondering if you had any advice on how to improve the running time of this little process.
Thank you and have a good day.

Setup (FYI make sure that your use parse_dates=True when reading your csv)

In [6]: n_groups = 10000

In [7]: N = 4000000

In [8]: dates = date_range('20130101',periods=100)

In [9]: df = DataFrame(dict(id = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))

In [10]: pd.set_option('max_rows',10)

In [13]: df = DataFrame(dict(card_no = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))

In [14]: df
Out[14]: 
         card_no       date  status
0           5790 2013-02-11       6
1           6572 2013-03-17       6
2           7764 2013-02-06       3
3           4905 2013-04-01       3
4           3871 2013-04-08       1
...          ...        ...     ...
3999995     1891 2013-02-16       5
3999996     9048 2013-01-11       9
3999997     1443 2013-02-23       1
3999998     2845 2013-01-28       0
3999999     5645 2013-02-05       8

[4000000 rows x 3 columns]

In [15]: df.dtypes
Out[15]: 
card_no             int64
date       datetime64[ns]
status              int64
dtype: object

Only status == 1, groupby card_no, then return the max date for that group

In [18]: df[df.status==1].groupby('card_no')['date'].max()
Out[18]: 
card_no
0         2013-04-06
1         2013-03-30
2         2013-04-09
...
9997      2013-04-07
9998      2013-04-07
9999      2013-04-09
Name: date, Length: 10000, dtype: datetime64[ns]

In [19]: %timeit df[df.status==1].groupby('card_no')['date'].max()
1 loops, best of 3: 934 ms per loop

If you need a transform of this (eg the same values for each group. Note that with < 0.14.1 (releasing this week) you will need to use this soln here , otherwise this will be pretty slow)

In [20]: df[df.status==1].groupby('card_no')['date'].transform('max')
Out[20]: 
4    2013-04-10
13   2013-04-10
25   2013-04-10
...
3999973   2013-04-10
3999979   2013-04-10
3999997   2013-04-09
Name: date, Length: 399724, dtype: datetime64[ns]

In [21]: %timeit df[df.status==1].groupby('card_no')['date'].transform('max')
1 loops, best of 3: 1.8 s per loop

I suspect you prob want to merge the final transform back into the original frame

In [24]: df.join(res.to_frame('max_date'))
Out[24]: 
         card_no       date  status   max_date
0           5790 2013-02-11       6        NaT
1           6572 2013-03-17       6        NaT
2           7764 2013-02-06       3        NaT
3           4905 2013-04-01       3        NaT
4           3871 2013-04-08       1 2013-04-10
...          ...        ...     ...        ...
3999995     1891 2013-02-16       5        NaT
3999996     9048 2013-01-11       9        NaT
3999997     1443 2013-02-23       1 2013-04-09
3999998     2845 2013-01-28       0        NaT
3999999     5645 2013-02-05       8        NaT

[4000000 rows x 4 columns]

In [25]: %timeit df.join(res.to_frame('max_date'))
10 loops, best of 3: 58.8 ms per loop

The csv writing will actually take a fair amount of time relative to this. I used HDF5 for things like this, MUCH faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM