[英]check if value in tuple of dataframe
I have huge a dataframe (38 milj rows): 我有一个巨大的数据帧(38 milj行):
df = pd.DataFrame({'I':[1,2,3,4], 'C':[80,160,240,80],
'F':[(1,2,3,4),(5,7,2),(9,6,2,5,7),(4,0,8,3,2)]})
C F I
0 80 (1, 2, 3, 4) 1
1 160 (5, 7, 2) 2
2 240 (9, 6, 2, 5, 7) 3
3 80 (4, 0, 8, 3, 2) 4
Now I would like to filter out the rows which contain the number 3
in 'F'
现在我想过滤掉'F'
包含数字3
的行
To give: 给:
C F I
0 80 (1, 2, 3, 4) 1
3 80 (4, 0, 8, 3, 2) 4
Is there a high performant, low memory usage way to do this? 是否有高性能,低内存使用方式来做到这一点?
I tried np.equal((3), df['F'].values).all()
but this obviously does not work 我试过np.equal((3), df['F'].values).all()
但这显然不起作用
Use in
with list comprehension
if performance is important: 使用in
与list comprehension
如果性能是很重要的:
df = df[[3 in x for x in df['F']]]
Or: 要么:
df = df[df['F'].apply(set) >= set([3])]
print (df)
I C F
0 1 80 (1, 2, 3, 4)
3 4 80 (4, 0, 8, 3, 2)
Performance (depends of the number of matched values, and of the length of df
, too): 性能(取决于匹配值的数量,以及df
的长度):
#[40000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [166]: %timeit df[[3 in x for x in df['F']]]
5.57 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [167]: %timeit df[df['F'].apply(lambda x: 3 in x)]
12.2 ms ± 625 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [170]: %timeit df[df['F'].apply(set) >= set([3])]
29 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [171]: %timeit df[pd.DataFrame(df['F'].values.tolist()).eq(3).any(1)]
37.4 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Better structure, like pointed @jpp is create: 更好的结构,如尖头的@jpp创建:
from itertools import chain
lens = df['F'].str.len()
df = pd.DataFrame({
'I' : df['I'].values.repeat(lens),
'C' : df['C'].values.repeat(lens),
'F' : list(chain.from_iterable(df['F'].tolist()))
})
print (df)
I C F
0 1 80 1
1 1 80 2
2 1 80 3
3 1 80 4
4 2 160 5
5 2 160 7
6 2 160 2
7 3 240 9
8 3 240 6
9 3 240 2
10 3 240 5
11 3 240 7
12 4 80 4
13 4 80 0
14 4 80 8
15 4 80 3
16 4 80 2
You should use in
operator in combination with apply
method by passing a lambda
expression . 您应该通过传递lambda
表达式 apply
in
运算符与apply
方法结合使用。
df[df['F'].apply(lambda x: 3 in x)]
Output 产量
I C F
0 1 80 (1, 2, 3, 4)
3 4 80 (4, 0, 8, 3, 2)
Is there a high performant, low memory usage way to do this? 是否有高性能,低内存使用方式来做到这一点?
No, there isn't. 不,没有。 A series of tuples is not vectorised. 一系列元组不是矢量化的。 It consists of a double layer of pointers, which is not suited to Pandas / NumPy. 它由一个双层指针组成,不适合Pandas / NumPy。 You can use hacks such as the str
accessor or a list comprehension. 您可以使用诸如str
访问器或列表理解之类的hack。 Or, even, attempt to expand into a dataframe: 或者,甚至尝试扩展到数据框:
mask = pd.DataFrame(df['F'].values.tolist()).eq(3).any(1)
print(mask)
0 True
1 False
2 False
3 True
dtype: bool
But all of these are expensive. 但所有这些都很昂贵。 To improve performance, you should look to improve how data is structured before the series is constructed. 为了提高性能,您应该在构建系列之前改进数据的结构。
A simple apply inside a loc
will do the trick 在loc
内部进行简单的应用就可以了
df.loc[df.F.apply(lambda t : 3 in t)]
I C F
0 1 80 (1, 2, 3, 4)
3 4 80 (4, 0, 8, 3, 2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.