I have a large dataset that looks similar to this in terms of content:
test = pd.DataFrame({'date':['2018-08-01','2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'],
'account':['a','a','a','a','b','b','b','c','c','c','d','e']})
For each account, I am attempting to create a column that specifies "Yes" to rows that have the earliest date (even if that earliest date repeats), and "No" otherwise. I am using the following code which works nicely on a smaller subset of this data, but not on my entire (larger) dataset.
first_date = test.groupby('account').agg({'date':np.min})
test['first_date'] = 'No'
for row in first_date.iterrows():
account = row[0]
date = row[1].date
mask = (test.account == account) & (test.date == date)
test.loc[mask, 'first_date'] = 'Yes'
Any ideas for improvement? I'm fairly new to python and already having runtime issues for larger datasets that use pandas DataFrame. Thanks in advance.
Generally when we use pandas or numpy we want to avoid iterating over our data and use the provided vectorized methods.
Use groupby.transform
to get a min
date on each row, then use np.where
to create your conditional column:
m = test['date'] == test.groupby('account')['date'].transform('min')
test['first_date'] = np.where(m, 'Yes', 'No')
date account first_date
0 2018-08-01 a Yes
1 2018-08-01 a Yes
2 2018-08-02 a No
3 2018-08-03 a No
4 2019-09-01 b Yes
5 2019-09-02 b No
6 2019-09-03 b No
7 2020-01-02 c Yes
8 2020-01-03 c No
9 2020-01-04 c No
10 2020-10-04 d Yes
11 2020-10-05 e Yes
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.