![](/img/trans.png)
[英]Compare 2 pandas.DataFrames, get differences and print only rows that changed from the first one
[英]Compare PandaS DataFrames and return rows that are missing from the first one
我有 2 個數據幀,想比較它們並從第一個(df1)中返回不在第二個(df2)中的行。 我找到了一種比較它們並返回差異的方法,但無法弄清楚如何僅從 df1.
import pandas as pd
from pandas import Series, DataFrame
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)
基於@EdChum 的建議:
df = pd.merge(df1, df2, how='outer', suffixes=('','_y'), indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]
rows_in_df1_not_in_df2
|Index |City |State |
|------|------------|------------|
|1 |San Franciso|California |
|2 |Boston |Massachusett|
編輯:合並@RobertPeters 的建議
IIUC 那么如果你使用的是0.17.0
版的熊貓,那么你可以使用merge
並設置indicator=True
:
In [80]:
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
pd.merge(df1,df2, how='outer', indicator=True)
Out[80]:
City State _merge
0 Chicago Illinois both
1 San Franciso California left_only
2 Boston Massachusett left_only
3 Mmmmiami Florida right_only
4 Dallas Texas right_only
5 Omaha Nebraska right_only
這將添加一列以指示行是否僅存在於 lhs 或 rhs
如果你在熊貓 < 0.17.0
你可以像
In [182]: df = pd.merge(df1, df2, on='City', how='outer')
In [183]: df
Out[183]:
City State_x State_y
0 Chicago Illinois Illinois
1 San Franciso California NaN
2 Boston Massachusett NaN
3 Mmmmiami NaN Florida
4 Dallas NaN Texas
5 Omaha NaN Nebraska
In [184]: df.ix[df['State_y'].isnull(),:]
Out[184]:
City State_x State_y
1 San Franciso California NaN
2 Boston Massachusett NaN
您還可以使用列表理解並比較行以返回缺失的元素:
dif_list = [x for x in list(df1['City'].unique()) if x not in list(df2['City'].unique())]
返回:
['San Franciso', 'Boston']
然后,您可以獲得一個僅包含不同行的數據框:
dfdif = df1[(df1['City'].isin(dif_list))]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.