[英]How to compare a dataframe with another df and return new rows from the first df using pandas
I am trying to ascertain if values in a test dataframe ( df2 ) are not appearing in another DF ( df1 ).我试图确定测试dataframe ( df2 ) 中的值是否未出现在另一个 DF ( df1 ) 中。 The following are the two DFs:下面是两个DF:
df1 created from the following source: df1从以下来源创建:
field1字段1 | field2字段2 |
---|---|
AG股份公司 | Agree同意 |
SA SA | Somewhat Agree有点同意 |
DG总司 | Disagree不同意 |
SD标清 | Somewhat Disagree不太同意 |
NO不 | None没有任何 |
df2 created from the following source: df2从以下来源创建:
field1字段1 | field2字段2 |
---|---|
CA加州 | California加州 |
TX TX | Texas得克萨斯州 |
NO不 | None没有任何 |
NY纽约 | New York纽约 |
Using Method 1 (see below), I am getting the expected result, which is:使用方法 1 (见下文),我得到了预期的结果,即:
Method 1方法一
diff_df = df2[~(df2[field1].isin(df1[field1]) & df2[field2].isin(df1[field2]))].reset_index(drop=True)
This gives me the folllowing expected result:这给了我以下预期结果:
field1 field2
0 CA California
1 TX Texas
2 NY New York
Note: The duplicate value in df2 ( NO: None
) gets dropped, too.注意: df2 ( NO: None
) 中的重复值也会被删除。
However, there is one problem that I am facing: There can be situations when there are different set of fields that will need to be compared (eg. there may be a third field field3 in the equation).但是,我面临一个问题:在某些情况下,需要比较一组不同的字段(例如,等式中可能有第三个字段field3 )。
From case to case basis, the number of fields would vary greatly over which the user won't have control.视具体情况而定,用户无法控制的字段数量会有很大差异。
My problem: How do I modify my query so that by comparing the two dataframes I get the expected result?我的问题:如何修改我的查询以便通过比较两个数据帧得到预期的结果?
In the situation as explained, what shuld be the possible approach?在所解释的情况下,可能的方法应该是什么?
Try:尝试:
import pandas as pd
# Creating the dataframe
df1_field1 = ['a', 'b', 'c', 'd']
df1_field2 = ['a', 'b', 'c', 'd']
df1_field3 = ['a', 'b', 'c', 'd']
df1 = pd.DataFrame(
{'field1':df1_field1,
'field2':df1_field2,
'field3':df1_field3,
})
df2_field1 = ['a', 'c', 'd', 'e']
df2_field2 = ['a', 'c', 'd', 'e']
df2_field3 = ['a', 'c', 'd', 'e']
df2 = pd.DataFrame(
{'field1':df2_field1,
'field2':df2_field2,
'field3':df2_field3,
})
print(df)
print(df2)
df_all = df2.merge(df1.drop_duplicates(), on=['field1','field2','field3'],
how='outer', indicator=True)
df_all[df_all['_merge'] == 'left_only']
it yields:它产生:
field1 field2 field3
0 a a a
1 b b b
2 c c c
3 d d d
field1 field2 field3
0 a a a
1 c c c
2 d d d
3 e e e
field1 field2 field3 _merge
3 e e e left_only
As you can see, it works, and it is just an adaptation of another answer in the page you put in the description.如您所见,它有效,并且只是对您在描述中放置的页面中的另一个答案的改编。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.