简体   繁体   English

比较两个数据框以使用Pandas返回新数据框-Python

[英]Comparing two dataframes to return a new dataframe using pandas - Python

Need your help please. 请需要您的帮助。

I have two dataframes created from csvs and I need to return a new dataframe which will be the difference between the two on a specific field/column. 我有两个从csvs创建的数据框,我需要返回一个新的数据框,这将是特定字段/列上两者之间的差异。 For example, if ID from df1 is not in df2, then df3 should give me all columns and rows from df1 that are not in df2. 例如,如果来自df1的ID不在df2中,则df3应该为我提供df1中不在df2中的所有列和行。

Note df1 and df2 columns are not identical ie df1 could have more or less columns than df2 but the columns in df3 should be as of d1. 注意df1和df2列不相同,即df1的列可能比df2多或少,但df3中的列应与d1相同。 Also, the ID (from df1) and User ID (df2) values are going to be the reconciling factor, the data in the fields will be the common factor but the actual field names are different. 同样,ID(来自df1)和User ID(df2)值将成为调节因素,字段中的数据将成为共同因素,但实际字段名称不同。

Apologies in advance as the tables below are not clear. 由于下表不明确,请提前道歉。 So in the below example, 1st row in df1 is not in df2, df3 should have this row. 因此,在下面的示例中,df1中的第一行不在df2中,因此df3应该具有该行。 Once done, I need to save df3 as csv. 完成后,我需要将df3保存为csv。

DF1 DF1

Direction ID Quantity Company Status 方向ID数量公司状态

Sell - 09 - 32000 - T LTD - Rejected 卖-09-32000-T LTD-已拒绝

Buy - 12 - 25000 - G Ltd - Done 买-12-25000-G Ltd-完成

Sell - 15 - 35000 - H Ltd - Done 卖-15-35000-H Ltd-完成

DF2 DF2

Direction User ID Quantity Company Status Rating 方向用户ID数量公司状态等级

Buy - 12 - 25000 - G Ltd - Done - Good Rating 买-12-25000-G Ltd-完成-好评

Sell - 15 - 35000 - H Ltd - Done - Good Rating 卖-15-35000-H Ltd-完成-好评

Many thanks in advance 提前谢谢了

code so far: 到目前为止的代码:

import pandas as pd

fileLocationDF1 = "BBG.csv"
fileLocationDF2 = "corp.csv"

createDf1 = pd.read_csv(fileLocationDF1, low_memory = False)
createDf2 = pd.read_csv(fileLocationDF2, engine='python')

I have found the isin method which I think will help but the problem is that the "User ID" column (df2) has a space in the data frame (as is the case in the csv). 我发现了isin方法,我认为这会有所帮助,但问题是“用户ID”列(df2)在数据帧中有一个空格(在csv中就是这种情况)。

createDf1[createDf1.ID.isin(createDf2.columns[2].values)]

and I get the below error when 当我收到以下错误

AttributeError: 'str' object has no attribute 'values'

I passed columns [2] in the isin method as the User ID has a space 我在isin方法中传递了[2]列,因为用户ID带有空格

Please help address the error and why the data is not being read so that i can get a unique set where the user Id from df2 is not in ID in df1. 请帮助解决该错误以及为什么不读取数据的原因,这样我就可以获得一个唯一的集合,其中来自df2的用户ID不在df1中的ID中。

See below - the one highligted is the one that is missing in DF2 and I would like this in df3 见下文-高限的是DF2中缺少的一个,我想在df3中

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

I would do: 我会做:

import pandas as pd

fileLocationDF1 = "BBG.csv"
fileLocationDF2 = "corp.csv"

createDf1 = pd.read_csv(fileLocationDF1, low_memory = False)
createDf2 = pd.read_csv(fileLocationDF2, engine='python')

# df3 will have createDf1 columns with ID's that are not in createDf2
# ~ means 'not' to the filter
# Acces the column via ['COLUMN NAME'] so you can put spaces into it ;)
df3 = createDf1[~createDf1['ID'].isin(createDf2['User ID'])]

I hope this helps! 我希望这有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM