简体   繁体   English

如何使用特定列系统地比较两个 Pandas 数据帧中的所有行并返回差异?

[英]How do I systematically compare all rows in two Pandas dataframes using specific columns and return the differences?

I have two large dataframes from different sources, largely of the same structure but of different lengths and in a different order.我有两个来自不同来源的大型数据框,它们的结构基本相同,但长度不同,顺序也不同。 Most of the data is comparable but not all.大多数数据具有可比性,但不是全部。 The rows represent individuals and the the columns contain data about those individuals.行代表个人,列包含有关这些个人的数据。 I want to check by row certain column values of one dataframe against the 'master' dataframe and then return the omissions so these can be added to it.我想逐行检查一个 dataframe 与“主” dataframe 的某些列值,然后返回遗漏,以便可以将它们添加到其中。

I have been using the df.query() method to check individual cases using my own inputs because I can search the master dataframe using multiple columns - so, something like df.query('surname == "JONES" and initials == "DV" and town == "LONDON"') .我一直在使用df.query()方法使用自己的输入检查个别案例,因为我可以使用多列搜索主 dataframe - 所以,像df.query('surname == "JONES" and initials == "DV" and town == "LONDON"') I want to do something like this but by creating a query of each row of the other dataframe using data from specific columns.我想做这样的事情,但是通过使用来自特定列的数据创建另一个 dataframe 的每一行的查询。

I think I can work out how I might do this using for loops and if statements but that's going to be wildly inefficient and obviously not ideal .我想我可以弄清楚如何使用 for 循环和 if 语句来做到这一点,但这将非常低效并且显然不理想 List comprehension might be more efficient but I can't work out the dataframe comparison part unless I create a new column whose data is built from the values I want to compare (JONES-DV-LONDON, but that seems wrong).列表理解可能更有效,但我无法计算出 dataframe 比较部分,除非我创建一个新列,其数据是根据我要比较的值构建的(JONES-DV-LONDON,但这似乎是错误的)。

There is an answer in here I think but it relies on the dataframes being more or less identical (which mine aren't - hence my wish to compare only certain columns).我认为 这里有一个答案,但它依赖于数据帧或多或少相同(我的不是 - 因此我希望只比较某些列)。

I have been unable to find an example of someone doing the same, which might be my failure again.我一直找不到有人这样做的例子,这可能又是我的失败。 I am a novice and I have a feeling I might be thinking about this in completely the wrong way.我是一个新手,我有一种感觉,我可能以完全错误的方式思考这个问题。 I would very much value any thoughts and pointers...我非常重视任何想法和建议......

EDIT - some sample data (not exactly what I'm using but hopefully helps show what I am trying to do)编辑- 一些示例数据(不完全是我正在使用的,但希望有助于显示我正在尝试做的事情)

df1 (my master list)
surname    initials    town
JONES      D V         LONDON
DAVIES     H G         BIRMINGHAM

df2
surname    initials    town
DAVIES     H G         BIRMINGHAM
HARRIS     P J         SOUTHAMPTON
JONES      D V         LONDON

I would like to identify the columns to use in the comparison (surname, initials, town here - assume there are more with data that cannot be matched) and then return the unique results from df2 - so in this case a dataframe containing:我想确定要在比较中使用的列(这里的姓氏、姓名首字母、城镇 - 假设有更多无法匹配的数据),然后从 df2 返回唯一结果 - 所以在这种情况下 dataframe 包含:

surname    initials    town
HARRIS     P J         SOUTHAMPTON

define columns to join:定义要加入的列:

cols = ['surname', 'initials', 'town']

Than you can use merge with parameter indicator=True which shows appearance of the data (left_only, right_only or both):比您可以使用带有参数indicator=True的合并,它显示数据的外观(left_only、right_only 或两者):

df_res = df1.merge(df2, 'outer',on=cols, indicator=True)

and exclude rows appear in both dataframes:并排除行出现在两个数据框中:

df_res = df_res[df_res['_merge'] != 'both']
print(df_res)

    surname initials    town        _merge
2   HARRIS  P J         SOUTHAMPTON right_only

you can filter by left_only or right only.您可以仅按 left_only 或 right 过滤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM