[英]Compare multiple columns of two data frames using pandas
I have two data frames;我有两个数据框; df1 has Id and sendDate and df2 has Id and actDate.
df1 有 Id 和 sendDate,而 df2 有 Id 和 actDate。 The two df's are not the same shape - df2 is a lookup table.
两个 df 的形状不同 - df2 是一个查找表。 There may be multiple instances of Id.
可能有多个 Id 实例。
ex.前任。
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
I want to add a boolean True/False in df1 to find when df1.Id == df2.Id
and df1.sendDate == df2.actDate
.我想在 df1 中添加 boolean True/False 以查找
df1.Id == df2.Id
和df1.sendDate == df2.actDate
的时间。
Expected output would add a column to df1:预期 output 将向 df1 添加一列:
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11", "2018-01-06", "2018-01-06", "2019-09-24"],
"Match?": [True, False, False, False, True]})
I'm new to python from R, so please let me know what other info you may need.我是来自 R 的 python 的新手,所以请告诉我您可能需要的其他信息。
Use isin and boolean indexing使用isin和boolean 索引
import pandas as pd
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 2],
"sendDate": ["2019-09-24", "2020-09-11",
"2018-01-06", "2018-01-06",
"2019-09-24"]})
df2 = pd.DataFrame({"Id": [1, 2, 2],
"actDate": ["2019-09-24", "2019-09-24", "2020-09-11"]})
df1['Match'] = (df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate']))
print(df1)
Output: Output:
Id sendDate Match
0 1 2019-09-24 True
1 1 2020-09-11 True
2 2 2018-01-06 False
3 3 2018-01-06 False
4 2 2019-09-24 True
The .isin()
approaches will find values where the ID and date entries don't necessarily appear together (eg Id=1
and date=2020-09-11
in your example). .isin()
方法将找到 ID 和日期条目不一定一起出现的值(例如,您的示例中的Id=1
和date=2020-09-11
)。 You can check for both by doing a .merge()
and checking when df2's date field is not null:您可以通过执行
.merge()
并检查 df2 的日期字段何时不是 null 来检查两者:
df1['match'] = df1.merge(df2, how='left', left_on=['Id', 'sendDate'], right_on=['Id', 'actDate'])['actDate'].notnull()
A vectorized
approach via numpy
-通过
numpy
的vectorized
方法 -
import numpy as np
df1['Match'] = np.where((df1['Id'].isin(df2['Id'])) & (df1['sendDate'].isin(df2['actDate'])),True,False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.