简体   繁体   English

使用 pandas 比较两个数据帧的多列

[英]Compare multiple columns of two data frames using pandas

I have two data frames;我有两个数据框; df1 has Id and sendDate and df2 has Id and actDate. df1 有 Id 和 sendDate,而 df2 有 Id 和 actDate。 The two df's are not the same shape - df2 is a lookup table.两个 df 的形状不同 - df2 是一个查找表。 There may be multiple instances of Id.可能有多个 Id 实例。

ex.前任。

df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
                     "sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"]})

df2 = pd.DataFrame({"Id": [1, 2, 2],
                     "actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})

I want to add a boolean True/False in df1 to find when df1.Id == df2.Id and df1.sendDate == df2.actDate .我想在 df1 中添加 boolean True/False 以查找df1.Id == df2.Iddf1.sendDate == df2.actDate的时间。

Expected output would add a column to df1:预期 output 将向 df1 添加一列:

df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
                         "sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"],
"Match?": [True, False, False, False, True]})

I'm new to python from R, so please let me know what other info you may need.我是来自 R 的 python 的新手,所以请告诉我您可能需要的其他信息。

Use isin and boolean indexing使用isinboolean 索引

import pandas as pd

df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
                    "sendDate": ["2019-09-24", "2020-09-11",
                                 "2018-01-06", "2018-01-06",
                                 "2019-09-24"]})

df2 = pd.DataFrame({"Id": [1, 2, 2],
                    "actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})

df1['Match'] = (df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate']))
print(df1)

Output: Output:

   Id    sendDate  Match
0   1  2019-09-24   True
1   1  2020-09-11   True
2   2  2018-01-06  False
3   3  2018-01-06  False
4   2  2019-09-24   True

The .isin() approaches will find values where the ID and date entries don't necessarily appear together (eg Id=1 and date=2020-09-11 in your example). .isin()方法将找到 ID 和日期条目不一定一起出现的值(例如,您的示例中的Id=1date=2020-09-11 )。 You can check for both by doing a .merge() and checking when df2's date field is not null:您可以通过执行.merge()并检查 df2 的日期字段何时不是 null 来检查两者:

df1['match'] = df1.merge(df2, how='left', left_on=['Id', 'sendDate'], right_on=['Id', 'actDate'])['actDate'].notnull()

A vectorized approach via numpy -通过numpyvectorized方法 -

import numpy as np
df1['Match'] = np.where((df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate'])),True,False)

You can use .isin() :您可以使用.isin()

df1['id_bool'] = df1.Id.isin(df2.Id)
df1['date_bool'] = df1.sendDate.isin(df2.actDate)

Check out the documentation here .此处查看文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM