简体   繁体   English

使用 Pandas 查找 2 个不同大小的数据帧之间的差异

[英]Finding the difference between 2 different sized dataframes with Pandas

I have 2 CSV files that were created at different dates that I want to compare and show what remained the same and what has changed.我有 2 个 CSV 文件,这些文件是在不同日期创建的,我想比较并显示保持不变和发生变化的内容。 I don't know where or how to begin, because when I try different merges and joins I run into the issue of the dataframes not being the same size.我不知道从哪里开始或如何开始,因为当我尝试不同的合并和连接时,我遇到了数据框大小不同的问题。

df1 :
I    ID            Status         
0   123            Active   
1   124            Active  
2   125            Inactive   
3   126            Active  
4   128            Inactive  
df2: 
I    ID            Status         
0   123            Active   
1   124            Inactive 
2   125            Inactive   
3   126            Active  
4   128            Active
5   129            Active  
6   130            Active   
7   131            Active
8   132            Inactive   

The goal is to highlight the status changes from df1 to df2 and remained constant from df1 to df2.目标是突出从 df1 到 df2 的状态变化,并从 df1 到 df2 保持不变。 Using the example above maybe I create 2 separate Dataframes that look something like this:使用上面的示例,我可能会创建 2 个单独的 Dataframe,如下所示:

df3: (containing all new changes)
I    ID              Status           
1    124             Inactive  
4    128             Active 
5    129             Active  
6    130             Active   
7    131             Active  
df4: (containing all other ‘Active’ one that remained consistent)
I    ID             Status         
0   123             Active     
3   126             Active

To explain the logic behind each row and why it is included in df3, I will go row by row, because I don't know if my example is clear enough为了解释每一行背后的逻辑以及为什么包含在df3中,我将逐行go,因为我不知道我的例子是否足够清楚

df3:
Index 1 - active to inactive
Index 4 - inactive to active
Index 5 - new active row
Index 6 - new active row
Index 7 - new active row
Index 8 - new inactive row
df4:
Index 0 - remained constant
Index 2 - remained constant
Index 3 - remained constant 

I don't know how to approach this because with merge and join I come across the error that the dataframes need to be the same size.我不知道如何解决这个问题,因为通过合并和连接我遇到了数据框需要相同大小的错误。 Basically, what I want to do find what changed and what stayed the same from df1 to df2.基本上,我想做的是找到从 df1 到 df2 的变化和保持不变的地方。 I have 2 sample datasets that I am working with, they have more statuses but the idea is the same.我有 2 个我正在使用的示例数据集,它们有更多的状态,但想法是一样的。 Here is a google sheet with both csv files, the updated_values would be df2 and the original_values would be df1. 是一个包含 csv 文件的谷歌表,updated_values 为 df2,original_values 为 df1。

You need to perform a full outer join to get all the entries from both datasets.您需要执行完全外连接才能从两个数据集中获取所有条目。 All the values of df2 that are not in df1 will be filled with NaN values. df2 中所有不在 df1 中的值都将用 NaN 值填充。

df3 = pd.merge(left=df1,right=df2,on='ID',how='outer', indicator=True)

This new df will contain a column 'Status_x' with values of df1, and 'Status_y' with values of df2.这个新的 df 将包含一个值为 df1 的列“Status_x”和值为 df2 的“Status_y”。 Then you can simply create a new column called 'change' to store the changes.然后您可以简单地创建一个名为“更改”的新列来存储更改。 You could use boolean indexing to check which columns have changed:您可以使用 boolean 索引来检查哪些列已更改:

new_rows = df3['_merge'] == 'right_only' # True if the IDs were not in df1
constant = df3['Status_x'] == df3['Status_y'] # True if the Status is the same for both Df

df3['change'] = df3['Status_x'] + ' to ' + df3['Status_y'] # String concatenation to show status change. E.g.: 'Active to Inactive'
df3.loc[new_rows,'change'] = 'New active row' #Sets the value for all new rows
df3.loc[constant,'change'] = 'Remained constant' #Sets the value for columns that remained constant

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM