简体   繁体   English

将.isin 应用于 pandas 中每一行的有效方法

[英]Efficient way to apply .isin to each row in pandas

How can I use.isin for pandas where it will use values from each of the rows in the dataframe, and not static values.我如何将.isin 用于 pandas,它将使用 dataframe 中每一行的值,而不是 static 值。

For example lets say we have dataframe like:例如,假设我们有 dataframe 像:

import pandas as pd
import datetime

l = []

for i in range(100000):
    d = {'a':i,'b':{1,2,3},'c':0}
    l.append(d)

df = pd.DataFrame(l)

If I use.isin, it can only take 1 list of values (in this example {1,2,3}) and will be compared to each of the values in the column you want to compare (ie df['a'])如果我使用.isin,它只能取 1 个值列表(在本例中为 {1,2,3}),并将与您要比较的列中的每个值进行比较(即 df['a'] )

test = df['a'].isin({1,2,3})

If I want to compare each value of the column 'b' if values in 'a' is in df['b'] I can do the following below:如果我想比较列 'b' 的每个值,如果 'a' 中的值在 df['b'] 中,我可以执行以下操作:

def check(a, b):
    return a in b

test = list(map(check, df['a'], df['b']))

Of course in this example all values in df['b'] is the same, but can pretend it is not.当然,在这个例子中,df['b'] 中的所有值都是相同的,但可以假装不是。

Unfortunately this is about 5x slower than just using the.isin.不幸的是,这比使用.isin 慢了大约 5 倍。 My question is, is there a way to use.isin but for each of the values in df['b]?我的问题是,有没有一种方法可以使用.isin,但对于 df['b] 中的每个值? Or dont have to necessarily use.isin, but what would be a more efficient way to do it?或者不一定必须使用.isin,但是什么是更有效的方法呢?

You can use DataFrame.apply with in here:您可以in此处使用DataFrame.apply

df.apply(lambda x: x['a'] in x['b'], axis=1)
0        False
1         True
2         True
3         True
4        False
         ...  
99995    False
99996    False
99997    False
99998    False
99999    False
Length: 100000, dtype: bool

Or list_comprehension with zip which is faster:或者使用更快的list_comprehension zip

[a in b for a, b in zip(df['a'], df['b'])]
[False,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 ...]

Timings:时间:

%%timeit
def check(a, b):
    return a in b

list(map(check, df['a'], df['b']))

28.6 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
[a in b for a, b in zip(df['a'], df['b'])]

22.5 ms ± 851 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.apply(lambda x: x['a'] in x['b'], axis=1)

2.27 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM