如何提取pandas数据帧中的行而不是子集数据帧中的行

Question

I have two dataframes. 我有两个数据帧。 DF and SubDF . DF和SubDF 。 SubDF is a subset of DF . SubDF是DF的子集。 I want to extract the rows in DF that are NOT in SubDF . 我想在DF中提取SubDF 。

I tried the following: 我尝试了以下方法：

DF2 = DF[~DF.isin(SubDF)]

The number of rows are correct and most rows are correct, 行数是正确的，大多数行是正确的，

ie number of rows in subDF + number of rows in DF2 = number of rows in DF 即subDF的行数+ DF2的行数= DF的行数

but I get rows with NaN values that do not exist in the original DF 但我得到的行的NaN值在原始DF不存在

Not sure what I'm doing wrong. 不知道我做错了什么。

Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN 注意：原始DF没有任何NaN值，并且要仔细检查我之前做过DF.dropna()并且结果仍然产生NaN

Answer 1

You need merge with outer join and boolean indexing , because DataFrame.isin need values and index match: 您需要与outer join和boolean indexing merge ，因为DataFrame.isin需要values和index匹配：

DF = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})

print (DF)
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

SubDF = pd.DataFrame({'A':[3],
                   'B':[6],
                   'C':[9],
                   'D':[5],
                   'E':[6],
                   'F':[3]})

print (SubDF)
   A  B  C  D  E  F
0  3  6  9  5  6  3

#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4

Answer 2

Another way, borrowing the setup from @jezrael: 另一种方式，借用@jezrael的设置：

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})

sub = pd.DataFrame({'A':[3],
                    'B':[6],
                    'C':[9],
                    'D':[5],
                    'E':[6],
                    'F':[3]})

extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]

The rows may not be sorted in the original df order. 行可能不按原始df顺序排序。 If matching order is required: 如果需要匹配订单：

extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]

如何提取pandas数据帧中的行而不是子集数据帧中的行

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-02-21 09:47:29

解决方案2
1 2017-02-21 11:24:05

如何提取pandas数据帧中的行而不是子集数据帧中的行

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-02-21 09:47:29

解决方案2 1 2017-02-21 11:24:05

解决方案1
2 已采纳 2017-02-21 09:47:29

解决方案2
1 2017-02-21 11:24:05