After a while searching, I can't find an answer to what must be a common issue, so pointers welcomed.
I have a dataframe:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5], 'C' : [['a','b'],['b','c'] ,['g','h'],['x','y']]})
and I want to select a sub-set of that (some of the rows) which have values in the lists in the 'C' column which appear in a list of things i'm interested in. eg
listOfInterestingThings = [a, g]
so when the filter is applied I would have a df1:
df1 =
A B C
5 1 ['a','b']
3 3 ['g','h']
The dataframe I'm dealing with is a massive raw data import to RAM ~12GB in the current df form. About half that on disk as a series of json files.
I fully agree with @DSM .
As a last resort you can use this:
In [21]: df.loc[pd.DataFrame(df.C.values.tolist(), index=df.index) \
.isin(listOfInterestingThings).any(1)]
Out[21]:
A B C
0 5 1 [a, b]
2 3 3 [g, h]
or:
In [11]: listOfInterestingThings = set(['a', 'g'])
In [12]: df.loc[df.C.apply(lambda x: len(set(x) & listOfInterestingThings) > 0)]
Out[12]:
A B C
0 5 1 [a, b]
2 3 3 [g, h]
Explanation:
In [22]: pd.DataFrame(df.C.values.tolist(), index=df.index)
Out[22]:
0 1
0 a b
1 b c
2 g h
3 x y
In [23]: pd.DataFrame(df.C.values.tolist(), index=df.index).isin(listOfInterestingThings)
Out[23]:
0 1
0 True False
1 False False
2 True False
3 False False
This also works:
df[list(np.any(('a' in i) | ('g' in i) for i in df.C.values))]
A B C
0 5 1 [a, b]
2 3 3 [g, h]
Benchmarks:
time df.loc[df.C.apply(lambda x: len(set(x) & listOfInterestingThings)> 0)]
CPU times: user 873 µs, sys: 193 µs, total: 1.07 ms
Wall time: 987 µs
time df[list(np.any(('a' in i) | ('g' in i) for i in df.C.values))]
CPU times: user 1.02 ms, sys: 224 µs, total: 1.24 ms
Wall time: 1.08 ms
time df.loc[pd.DataFrame(df.C.values.tolist(), index=df.index).isin(listOfInterestingThings).any(1)]
CPU times: user 2.58 ms, sys: 1.01 ms, total: 3.59 ms
Wall time: 5.41 ms
So, in short, @MaxU's answer is the quickest method.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.