简体   繁体   中英

Improving speed of iterrows() query that is utilizing a mask

I have a large dataset that looks similar to this in terms of content:

test = pd.DataFrame({'date':['2018-08-01','2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'],
                 'account':['a','a','a','a','b','b','b','c','c','c','d','e']})

For each account, I am attempting to create a column that specifies "Yes" to rows that have the earliest date (even if that earliest date repeats), and "No" otherwise. I am using the following code which works nicely on a smaller subset of this data, but not on my entire (larger) dataset.

first_date = test.groupby('account').agg({'date':np.min})

test['first_date'] = 'No'
for row in first_date.iterrows():
    account = row[0]
    date = row[1].date
    mask = (test.account == account) & (test.date == date)
    test.loc[mask, 'first_date'] = 'Yes'

Any ideas for improvement? I'm fairly new to python and already having runtime issues for larger datasets that use pandas DataFrame. Thanks in advance.

Generally when we use pandas or numpy we want to avoid iterating over our data and use the provided vectorized methods.

Use groupby.transform to get a min date on each row, then use np.where to create your conditional column:

m = test['date'] == test.groupby('account')['date'].transform('min')
test['first_date'] = np.where(m, 'Yes', 'No')


          date account first_date
0   2018-08-01       a        Yes
1   2018-08-01       a        Yes
2   2018-08-02       a         No
3   2018-08-03       a         No
4   2019-09-01       b        Yes
5   2019-09-02       b         No
6   2019-09-03       b         No
7   2020-01-02       c        Yes
8   2020-01-03       c         No
9   2020-01-04       c         No
10  2020-10-04       d        Yes
11  2020-10-05       e        Yes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM