[英]How to apply a multiple argument function on a large dataframe with pandas?
I've faced a challange to join two dataframes in one.我面临着将两个数据框合二为一的挑战。
For example, I've 2 dataframes:例如,我有 2 个数据框:
eg:例如:
ID |身份证 | date |日期 |
001| 001| 01-01-2021| 01-01-2021|
001| 001| 02-01-2021| 02-01-2021|
001| 001| 03-01-2021| 03-01-2021|
001| 001| 04-01-2021| 04-01-2021|
001| 001| 05-01-2021| 05-01-2021|
... ...
002| 002| 01-01-2021| 01-01-2021|
002| 002| 02-01-2021| 02-01-2021|
002| 002| 03-01-2021| 03-01-2021|
eg:例如:
ID |身份证 | start_date |开始日期 | end_date |结束日期 | status |状态 |
001| 001| 01-01-2021 | 2021 年 1 月 1 日 | 02-01-2021| 02-01-2021| working |工作|
001| 001| 02-02-2021 | 2021 年 2 月 2 日 | 01-03-2021| 01-03-2021| not working|不工作|
The challenge is to identify all the status on df1 based on the id and the start and end date of df2..挑战是根据 df2 的 id 和开始和结束日期来识别 df1 上的所有状态。
Basically, if ID_df1 == ID_df2 and date_df1 >= start_date & date_df1 <= end_date, so I've to capture the status of df2, which one in this case is "Working"基本上,如果 ID_df1 == ID_df2 和 date_df1 >= start_date & date_df1 <= end_date,那么我必须捕获 df2 的状态,在这种情况下是“工作”
To solve this issue, I've created a function called status:为了解决这个问题,我创建了一个名为 status 的函数:
def status (df2, id, date):
pos = -1
for i in range(len(df2)):
if ((df2.iat[i,0] == id) & (pd.to_datetime(df2.iat[i,1] <= date) & (pd.to_datetime(df2.iat[i,2] >= date):
pos = i
break
if pos > -1:
return (df2.iat[pos,3])
else:
return "Not Found"
And my issue is to apply for the function "status" on every df1 data.我的问题是在每个 df1 数据上申请函数“状态”。
I've tried to create a column in blank and apply the function on a loop:我试图在空白中创建一个列并将该函数应用于循环:
df1['Status'] = ""
for i in range(len(df1)):
df1.[i,2] = status(df2, df1[i,0], df1[i,1])
It worked for me, but it took a loooot of time.. for the 300K, I've covered the first 4k in one hour.它对我有用,但花了很多时间.. 对于 300K,我在一小时内完成了前 4k。
Is there an easier way to do that?有没有更简单的方法来做到这一点?
Thank you very much!非常感谢!
I also 've tried我也试过
df1['Status'] = df1.apply(lambda x: status(df2, df1['ID'], df1['Date']),axis=1)
But not worked, I've got the error:但没有奏效,我得到了错误:
ValueError: The truth value of a Series is ambiguous. ValueError:Series 的真值不明确。 Use a.empty, a.bool(), a.item(), a.any() or a.all().使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。
Not an elegant, but probably much faster solution:不是一个优雅但可能更快的解决方案:
# big dataframe with all ID combinations
df = pd.merge(df1, df2, how='left', on='ID')
# conditions to filter the correct combinations
c1 = df['date'] >= df['start_date']
c2 = df['date'] <= df['end_date']
# final dataframe
df = df.loc[c1 & c2, ['ID', 'date', 'status']].copy()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.