[英]Efficient way to apply .isin to each row in pandas
How can I use.isin for pandas where it will use values from each of the rows in the dataframe, and not static values.我如何将.isin 用于 pandas,它将使用 dataframe 中每一行的值,而不是 static 值。
For example lets say we have dataframe like:例如,假设我们有 dataframe 像:
import pandas as pd
import datetime
l = []
for i in range(100000):
d = {'a':i,'b':{1,2,3},'c':0}
l.append(d)
df = pd.DataFrame(l)
If I use.isin, it can only take 1 list of values (in this example {1,2,3}) and will be compared to each of the values in the column you want to compare (ie df['a'])如果我使用.isin,它只能取 1 个值列表(在本例中为 {1,2,3}),并将与您要比较的列中的每个值进行比较(即 df['a'] )
test = df['a'].isin({1,2,3})
If I want to compare each value of the column 'b' if values in 'a' is in df['b'] I can do the following below:如果我想比较列 'b' 的每个值,如果 'a' 中的值在 df['b'] 中,我可以执行以下操作:
def check(a, b):
return a in b
test = list(map(check, df['a'], df['b']))
Of course in this example all values in df['b'] is the same, but can pretend it is not.当然,在这个例子中,df['b'] 中的所有值都是相同的,但可以假装不是。
Unfortunately this is about 5x slower than just using the.isin.不幸的是,这比使用.isin 慢了大约 5 倍。 My question is, is there a way to use.isin but for each of the values in df['b]?我的问题是,有没有一种方法可以使用.isin,但对于 df['b] 中的每个值? Or dont have to necessarily use.isin, but what would be a more efficient way to do it?或者不一定必须使用.isin,但是什么是更有效的方法呢?
You can use DataFrame.apply
with in
here:您可以in
此处使用DataFrame.apply
:
df.apply(lambda x: x['a'] in x['b'], axis=1)
0 False
1 True
2 True
3 True
4 False
...
99995 False
99996 False
99997 False
99998 False
99999 False
Length: 100000, dtype: bool
Or list_comprehension
with zip
which is faster:或者使用更快的list_comprehension
zip
:
[a in b for a, b in zip(df['a'], df['b'])]
[False,
True,
True,
True,
False,
False,
False,
False,
False,
False,
False,
False,
False,
...]
%%timeit
def check(a, b):
return a in b
list(map(check, df['a'], df['b']))
28.6 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
[a in b for a, b in zip(df['a'], df['b'])]
22.5 ms ± 851 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.apply(lambda x: x['a'] in x['b'], axis=1)
2.27 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.