[英]pandas - filter dataframe by another dataframe by row elements
I have a dataframe df1
which looks like:我有一个数据
df1
,它看起来像:
c k l
0 A 1 a
1 A 2 b
2 B 2 a
3 C 2 a
4 C 2 d
and another called df2
like:另一个叫
df2
像:
c l
0 A b
1 C a
I would like to filter df1
keeping only the values that ARE NOT in df2
.我想过滤
df1
只保留df2
中的值。 Values to filter are expected to be as (A,b)
and (C,a)
tuples.要过滤的值应为
(A,b)
和(C,a)
元组。 So far I tried to apply the isin
method:到目前为止,我尝试应用
isin
方法:
d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]
That seems to me too complicated, it returns:在我看来这太复杂了,它返回:
c k l
2 B 2 a
4 C 2 d
but I'm expecting:但我期待:
c k l
0 A 1 a
2 B 2 a
4 C 2 d
You can do this efficiently using isin
on a multiindex constructed from the desired columns:您可以在从所需列构造的多索引上使用
isin
有效地执行此操作:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on @IanS's similar solution because it doesn't assume any column type (ie it will work with numbers as well as strings).我认为这改进了@IanS 的类似解决方案,因为它不假定任何列类型(即它可以处理数字和字符串)。
(Above answer is an edit. Following was my initial answer) (上面的答案是一个编辑。以下是我最初的答案)
Interesting!有趣的! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where
df2
is defined.这是我以前没有遇到过的......我可能会通过合并两个数组来解决它,然后删除定义
df2
的行。 Here is an example, which makes use of a temporary array:这是一个使用临时数组的示例:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one.可能有一种方法可以在不使用临时数组的情况下做到这一点,但我想不出一个。 As long as your data isn't huge the above method should be a fast and sufficient answer.
只要您的数据不是很大,上述方法应该是一个快速而充分的答案。
这非常简洁并且效果很好:
df1 = df1[~df1.index.isin(df2.index)]
DataFrame.merge
& DataFrame.query
:DataFrame.merge
& DataFrame.query
: A more elegant method would be to do left join
with the argument indicator=True
, then filter all the rows which are left_only
with query
:更优雅的方法是使用参数
indicator=True
进行left join
,然后使用query
过滤所有left_only
行:
d = (
df1.merge(df2,
on=['c', 'l'],
how='left',
indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge')
)
print(d)
c k l
0 A 1 a
2 B 2 a
4 C 2 d
indicator=True
returns a dataframe with an extra column _merge
which marks each row left_only, both, right_only
: indicator=True
返回一个带有额外列_merge
的数据帧,该列标记每一行left_only, both, right_only
:
df1.merge(df2, on=['c', 'l'], how='left', indicator=True)
c k l _merge
0 A 1 a left_only
1 A 2 b both
2 B 2 a left_only
3 C 2 a both
4 C 2 d left_only
I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.我认为这是一种非常简单的方法,当您想要基于来自另一个数据帧的多列甚至基于自定义列表过滤数据帧时。
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]
#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]
How about:怎么样:
df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)
Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.避免创建额外列或进行合并的另一个选项是在 df2 上执行 groupby 以获得不同的 (c, l) 对,然后使用它过滤 df1 。
gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]
For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.对于这个小例子,它实际上似乎比基于 pandas 的方法运行得快一点(在我的机器上为 666 µs 对 1.76 ms),但我怀疑在更大的例子上它可能会更慢,因为它正在下降到纯 Python 中。
You can concatenate both DataFrames and drop all duplicates:您可以连接两个 DataFrame 并删除所有重复项:
df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False)
Output:输出:
c k l
0 A 1.0 a
2 B 2.0 a
4 C 2.0 d
This method doesn't work if you have duplicates subset=['c', 'l']
in df1
.如果您在
df1
中有重复subset=['c', 'l']
,则此方法不起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.