简体   繁体   English

对包含一列数组的 Pandas dataframe 进行条件切片的更快方法

[英]Faster way to do conditional slicing on a Pandas dataframe containing a column of array

Let's take this dataframe that has a column of arrays:让我们以具有 arrays 列的 dataframe 为例:

In:  df = pd.DataFrame([['one', np.array([1,2,3,4])], 
                        ['two', np.array([1,3])], 
                        ['three', np.array([0,2,4])]],
                       columns=['id', 'items'])

Out:
      id         items
0    one  [1, 2, 3, 4]
1    two        [1, 3]
2  three     [0, 2, 4]

If I want to filter by an element being in 'items' I would do:如果我想按“项目”中的元素进行过滤,我会这样做:

In: df[ df['items'].apply(lambda x: 2 in x)] 

Out:
       id         items
 1    one  [1, 2, 3, 4]
 2  three     [0, 2, 4]

However, this method is extremely slow and my dataframe is very large.但是,这种方法非常慢,而且我的 dataframe 非常大。 Is there any faster way to iterate through the elements in 'items'?有没有更快的方法来遍历“项目”中的元素?

Using sets you can check if a given number ( 2 here) is a set.subset the lists:使用sets ,您可以检查给定数字(此处为2 )是否为set.subset 。子集列表:

df[df['items'].agg({2}.issubset)]

     id         items
0    one  [1, 2, 3, 4]
2  three     [0, 2, 4]

Timings on a large dataframe:大 dataframe 上的计时:

df_large = pd.concat([df]*100_000, axis=0, ignore_index=True)

%timeit df_large[df_large['items'].agg({2}.issubset)]
# 355 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit  pd.DataFrame(df_large['items'].tolist()).isin([2]).any(1)
# 564 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_large[df_large['items'].explode().eq(2).any(level=0)]
# 658 ms ± 6.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can try explode ( new in pandas 0.25.0 ) with df.any您可以使用 df.any 尝试explodepandas 0.25.0 中的新功能df.any

df[df['items'].explode().eq(2).any(level=0)]

      id         items
0    one  [1, 2, 3, 4]
2  three     [0, 2, 4]

IIUC IIUC

m = pd.DataFrame(df['items'].tolist()).isin([2]).any(1)
Out[70]: 
0     True
1    False
2     True
dtype: bool
df1 = df[m].copy()

And we can try我们可以试试

[2 in x for x in df['items']]
Out[81]: [True, False, True]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM