简体   繁体   English

如何在带有熊猫的大型数据框上应用多参数函数?

[英]How to apply a multiple argument function on a large dataframe with pandas?

I've faced a challange to join two dataframes in one.我面临着将两个数据框合二为一的挑战。

For example, I've 2 dataframes:例如,我有 2 个数据框:

  1. The first df (df1) is the base with more than 300K lines, with multiple dates for every ID.第一个 df (df1) 是具有超过 300K 行的基础,每个 ID 都有多个日期。

eg:例如:

ID |身份证 | date |日期 |
001| 001| 01-01-2021| 01-01-2021|
001| 001| 02-01-2021| 02-01-2021|
001| 001| 03-01-2021| 03-01-2021|
001| 001| 04-01-2021| 04-01-2021|
001| 001| 05-01-2021| 05-01-2021|

... ...

002| 002| 01-01-2021| 01-01-2021|
002| 002| 02-01-2021| 02-01-2021|
002| 002| 03-01-2021| 03-01-2021|

  1. The second df (df2) is where I have some extra info that I should join on the first data frame (not so big as the first one, 40k lines)第二个df(df2)是我应该在第一个数据帧上加入一些额外信息的地方(不像第一个大,40k行)

eg:例如:

ID |身份证 | start_date |开始日期 | end_date |结束日期 | status |状态 |
001| 001| 01-01-2021 | 2021 年 1 月 1 日 | 02-01-2021| 02-01-2021| working |工作|
001| 001| 02-02-2021 | 2021 年 2 月 2 日 | 01-03-2021| 01-03-2021| not working|不工作|

The challenge is to identify all the status on df1 based on the id and the start and end date of df2..挑战是根据 df2 的 id 和开始和结束日期来识别 df1 上的所有状态。

Basically, if ID_df1 == ID_df2 and date_df1 >= start_date & date_df1 <= end_date, so I've to capture the status of df2, which one in this case is "Working"基本上,如果 ID_df1 == ID_df2 和 date_df1 >= start_date & date_df1 <= end_date,那么我必须捕获 df2 的状态,在这种情况下是“工作”

To solve this issue, I've created a function called status:为了解决这个问题,我创建了一个名为 status 的函数:

def status (df2, id, date):
    pos = -1
    for i in range(len(df2)):
        if ((df2.iat[i,0] == id) & (pd.to_datetime(df2.iat[i,1] <= date) & (pd.to_datetime(df2.iat[i,2] >= date):
           pos = i
        break
    if pos > -1:
       return (df2.iat[pos,3])
    else:
       return "Not Found"

And my issue is to apply for the function "status" on every df1 data.我的问题是在每个 df1 数据上申请函数“状态”。

I've tried to create a column in blank and apply the function on a loop:我试图在空白中创建一个列并将该函数应用于循环:

df1['Status'] = ""

for i in range(len(df1)):
    df1.[i,2] = status(df2, df1[i,0], df1[i,1])

It worked for me, but it took a loooot of time.. for the 300K, I've covered the first 4k in one hour.它对我有用,但花了很多时间.. 对于 300K,我在一小时内完成了前 4k。

Is there an easier way to do that?有没有更简单的方法来做到这一点?

Thank you very much!非常感谢!

I also 've tried我也试过

df1['Status'] = df1.apply(lambda x: status(df2, df1['ID'], df1['Date']),axis=1)

But not worked, I've got the error:但没有奏效,我得到了错误:

ValueError: The truth value of a Series is ambiguous. ValueError:Series 的真值不明确。 Use a.empty, a.bool(), a.item(), a.any() or a.all().使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

Not an elegant, but probably much faster solution:不是一个优雅但可能更快的解决方案:

# big dataframe with all ID combinations
df = pd.merge(df1, df2, how='left', on='ID')

# conditions to filter the correct combinations
c1 = df['date'] >= df['start_date']
c2 = df['date'] <= df['end_date']

# final dataframe
df = df.loc[c1 & c2, ['ID', 'date', 'status']].copy()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM