检查数据帧的元组中的值是否为

Question

I have huge a dataframe (38 milj rows): 我有一个巨大的数据帧（38 milj行）：

df = pd.DataFrame({'I':[1,2,3,4], 'C':[80,160,240,80],
                   'F':[(1,2,3,4),(5,7,2),(9,6,2,5,7),(4,0,8,3,2)]})

     C                F  I
0   80     (1, 2, 3, 4)  1
1  160        (5, 7, 2)  2
2  240  (9, 6, 2, 5, 7)  3
3   80  (4, 0, 8, 3, 2)  4

Now I would like to filter out the rows which contain the number 3 in 'F' 现在我想过滤掉'F'包含数字3的行

To give: 给：

     C                F  I
0   80     (1, 2, 3, 4)  1
3   80  (4, 0, 8, 3, 2)  4

Is there a high performant, low memory usage way to do this? 是否有高性能，低内存使用方式来做到这一点？

I tried np.equal((3), df['F'].values).all() but this obviously does not work 我试过np.equal((3), df['F'].values).all()但这显然不起作用

Answer 1

Use in with list comprehension if performance is important: 使用in与list comprehension如果性能是很重要的：

df = df[[3 in x for x in df['F']]]

Or: 要么：

df = df[df['F'].apply(set) >= set([3])]

print (df)
   I   C                F
0  1  80     (1, 2, 3, 4)
3  4  80  (4, 0, 8, 3, 2)

Performance (depends of the number of matched values, and of the length of df , too): 性能（取决于匹配值的数量，以及df的长度）：

#[40000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)


In [166]: %timeit df[[3 in x for x in df['F']]]
5.57 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [167]: %timeit df[df['F'].apply(lambda x: 3 in x)]
12.2 ms ± 625 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [170]:  %timeit df[df['F'].apply(set) >= set([3])]
29 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [171]:  %timeit df[pd.DataFrame(df['F'].values.tolist()).eq(3).any(1)]
37.4 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Better structure, like pointed @jpp is create: 更好的结构，如尖头的@jpp创建：

from itertools import chain

lens = df['F'].str.len()
df = pd.DataFrame({
    'I' : df['I'].values.repeat(lens),
    'C' : df['C'].values.repeat(lens),
    'F' : list(chain.from_iterable(df['F'].tolist()))
})
print (df)
    I    C  F
0   1   80  1
1   1   80  2
2   1   80  3
3   1   80  4
4   2  160  5
5   2  160  7
6   2  160  2
7   3  240  9
8   3  240  6
9   3  240  2
10  3  240  5
11  3  240  7
12  4   80  4
13  4   80  0
14  4   80  8
15  4   80  3
16  4   80  2

Answer 2

You should use in operator in combination with apply method by passing a lambda expression . 您应该通过传递lambda 表达式 apply in运算符与apply方法结合使用。

df[df['F'].apply(lambda x: 3 in x)]

Output 产量

   I   C                F
0  1  80     (1, 2, 3, 4)
3  4  80  (4, 0, 8, 3, 2)

Answer 3

Is there a high performant, low memory usage way to do this? 是否有高性能，低内存使用方式来做到这一点？

No, there isn't. 不，没有。 A series of tuples is not vectorised. 一系列元组不是矢量化的。 It consists of a double layer of pointers, which is not suited to Pandas / NumPy. 它由一个双层指针组成，不适合Pandas / NumPy。 You can use hacks such as the str accessor or a list comprehension. 您可以使用诸如str访问器或列表理解之类的hack。 Or, even, attempt to expand into a dataframe: 或者，甚至尝试扩展到数据框：

mask = pd.DataFrame(df['F'].values.tolist()).eq(3).any(1)

print(mask)

0     True
1    False
2    False
3     True
dtype: bool

But all of these are expensive. 但所有这些都很昂贵。 To improve performance, you should look to improve how data is structured before the series is constructed. 为了提高性能，您应该在构建系列之前改进数据的结构。

Answer 4

A simple apply inside a loc will do the trick 在loc内部进行简单的应用就可以了

df.loc[df.F.apply(lambda t : 3 in t)]


    I   C   F
0   1   80  (1, 2, 3, 4)
3   4   80  (4, 0, 8, 3, 2)

检查数据帧的元组中的值是否为

问题描述

4 个解决方案

解决方案1
5 已采纳 2018-10-22 12:01:30

解决方案2
1 2018-10-22 12:01:46

解决方案3
1 2018-10-22 12:05:18

解决方案4
0 2018-10-22 12:04:11

检查数据帧的元组中的值是否为

问题描述

4 个解决方案

解决方案1 5 已采纳 2018-10-22 12:01:30

解决方案2 1 2018-10-22 12:01:46

解决方案3 1 2018-10-22 12:05:18

解决方案4 0 2018-10-22 12:04:11

解决方案1
5 已采纳 2018-10-22 12:01:30

解决方案2
1 2018-10-22 12:01:46

解决方案3
1 2018-10-22 12:05:18

解决方案4
0 2018-10-22 12:04:11