[英]How to extract rows in a pandas dataframe NOT in a subset dataframe
I have two dataframes. 我有两个数据帧。 DF
and SubDF
. DF
和SubDF
。 SubDF
is a subset of DF
. SubDF
是DF
的子集。 I want to extract the rows in DF
that are NOT in SubDF
. 我想在DF
中提取SubDF
。
I tried the following: 我尝试了以下方法:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct, 行数是正确的,大多数行是正确的,
ie number of rows in subDF
+ number of rows in DF2
= number of rows in DF
即subDF
的行数+ DF2
的行数= DF
的行数
but I get rows with NaN
values that do not exist in the original DF
但我得到的行的NaN
值在原始DF
不存在
Not sure what I'm doing wrong. 不知道我做错了什么。
Note: the original DF
does not have any NaN
values, and to double check I did DF.dropna()
before and the result still produced NaN
注意:原始DF
没有任何NaN
值,并且要仔细检查我之前做过DF.dropna()
并且结果仍然产生NaN
You need merge
with outer join
and boolean indexing
, because DataFrame.isin
need values
and index
match: 您需要与outer join
和boolean indexing
merge
,因为DataFrame.isin
需要values
和index
匹配:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from @jezrael: 另一种方式,借用@jezrael的设置:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. 行可能不按原始df顺序排序。 If matching order is required: 如果需要匹配订单:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.