簡體   English   中英

比較 PandaS DataFrames 並返回第一個缺失的行

[英]Compare PandaS DataFrames and return rows that are missing from the first one

我有 2 個數據幀,想比較它們並從第一個(df1)中返回不在第二個(df2)中的行。 我找到了一種比較它們並返回差異的方法,但無法弄清楚如何僅從 df1.

import pandas as pd
from pandas import Series, DataFrame

df1 = pd.DataFrame( { 
"City" : ["Chicago", "San Franciso", "Boston"] , 
"State" : ["Illinois", "California", "Massachusett"] } )

df2 = pd.DataFrame( { 
"City" : ["Chicago",  "Mmmmiami", "Dallas" , "Omaha"] , 
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )



df = pd.concat([df1, df2])
df = df.reset_index(drop=True)

df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)

基於@EdChum 的建議:

df = pd.merge(df1, df2, how='outer', suffixes=('','_y'), indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]

rows_in_df1_not_in_df2

|Index |City        |State       |
|------|------------|------------|
|1     |San Franciso|California  |
|2     |Boston      |Massachusett|

編輯:合並@RobertPeters 的建議

IIUC 那么如果你使用的是0.17.0版的熊貓,那么你可以使用merge並設置indicator=True

In [80]:
df1 = pd.DataFrame( { 
"City" : ["Chicago", "San Franciso", "Boston"] , 
"State" : ["Illinois", "California", "Massachusett"] } )
​
df2 = pd.DataFrame( { 
"City" : ["Chicago",  "Mmmmiami", "Dallas" , "Omaha"] , 
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
pd.merge(df1,df2, how='outer', indicator=True)

Out[80]:
           City         State      _merge
0       Chicago      Illinois        both
1  San Franciso    California   left_only
2        Boston  Massachusett   left_only
3      Mmmmiami       Florida  right_only
4        Dallas         Texas  right_only
5         Omaha      Nebraska  right_only

這將添加一列以指示行是否僅存在於 lhs 或 rhs

如果你在熊貓 < 0.17.0

你可以像

In [182]: df = pd.merge(df1, df2, on='City', how='outer')

In [183]: df
Out[183]:
           City       State_x   State_y
0       Chicago      Illinois  Illinois
1  San Franciso    California       NaN
2        Boston  Massachusett       NaN
3      Mmmmiami           NaN   Florida
4        Dallas           NaN     Texas
5         Omaha           NaN  Nebraska

In [184]: df.ix[df['State_y'].isnull(),:]
Out[184]:
           City       State_x State_y
1  San Franciso    California     NaN
2        Boston  Massachusett     NaN

您還可以使用列表理解並比較行以返回缺失的元素:

dif_list = [x for x in list(df1['City'].unique()) if x not in list(df2['City'].unique())]

返回:

['San Franciso', 'Boston']

然后,您可以獲得一個僅包含不同行的數據框:

dfdif = df1[(df1['City'].isin(dif_list))]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM