pandas df子列中的字符串列表

Question

I have a complex, large pandas dataframe with one column, X that can contain either one list or a list of lists. 我有一个复杂的大型pandas数据框，其中一列X可以包含一个列表或一个列表列表。 I'm curious if the solution can apply to any content though, so I give a mock example with one element of X being a string as well: 我很好奇这个解决方案是否可以应用于任何内容，所以我给出了一个模拟示例，其中X的一个元素也是一个字符串：

df1 = pd.DataFrame({
    'A': [1, 1, 3], 
    'B': ['a', 'e', 'f'], 
    'X': ['something', ['hello'], [['something'],['hello']]]}
)

I want to get the subset of that dataframe, df2, for which column X contains the substring "hello", when whatever is in there is read as a string. 我想获得该数据帧的子集df2，其中列X包含子串“hello”，当其中的任何内容都以字符串形式读取时。

>>> df2
   A  B                       X
0  1  e                 [hello]
1  3  f  [[something], [hello]]

I have tried extensive combinations of str() and .str.contains, apply, map, .find(), list comprehensions, and nothing seems to work without getting into loops (related questions here and here . What am I missing? 我已经尝试过str（）和.str.contains，apply，map，.find（），list comprehensions的广泛组合，似乎没有任何东西可以工作而不会进入循环（相关问题在这里和这里。我错过了什么？

Answer 1

Adding astype before str.contains 在astype之前添加str.contains

df1[df1.X.astype(str).str.contains('hello')]
Out[538]: 
   A  B                       X
1  1  e                 [hello]
2  3  f  [[something], [hello]]

Answer 2

You can use np.ravel() to flatten nested list and use in operator 您可以使用np.ravel（）来展平嵌套列表并在运算符中使用

df1[df1['X'].apply(lambda x: 'hello' in np.ravel(x))]

    A   B   X
1   1   e   [hello]
2   3   f   [[something], [hello]]

Answer 3

Borrowing from @wim https://stackoverflow.com/a/49247980/2336654 借用@wim https://stackoverflow.com/a/49247980/2336654

The most general solution would be to allow for arbitrarily nested lists. 最通用的解决方案是允许任意嵌套列表。 Also, We can focus on the string elements being equal rather than containing. 此外，我们可以专注于字符串元素相等而不是包含。

# This import is for Python 3
# for Python 2 use `from collections import Iterable`
from collections.abc import Iterable

def flatten(collection):
    for x in collection:
        if isinstance(x, Iterable) and not isinstance(x, str):
            yield from flatten(x)
        else:
            yield x

df1[df1.X.map(lambda x: any('hello' == s for s in flatten(x)))]

   A  B                       X
1  1  e                 [hello]
2  3  f  [[something], [hello]]

So now if we complicate it 所以现在如果我们复杂化它

df1 = pd.DataFrame({
    'A': [1, 1, 3, 7, 7], 
    'B': ['a', 'e', 'f', 's', 's'], 
    'X': [
        'something',
        ['hello'],
        [['something'],['hello']],
        ['hello world'],
        [[[[[['hello']]]]]]
    ]}
)

df1

   A  B                       X
0  1  a               something
1  1  e                 [hello]
2  3  f  [[something], [hello]]
3  7  s           [hello world]
4  7  s     [[[[[['hello']]]]]]

Our filter does not grab hello world and does grab the very nested hello 我们的过滤器不会抓住hello world并抓住非常嵌套的hello

df1[df1.X.map(lambda x: any('hello' == s for s in flatten(x)))]

   A  B                       X
1  1  e                 [hello]
2  3  f  [[something], [hello]]
4  7  s     [[[[[['hello']]]]]]

pandas df子列中的字符串列表

问题描述

3 个解决方案

解决方案1
5 已采纳 2018-03-19 19:02:34

解决方案2
4 2018-03-19 19:01:46

解决方案3
1 2018-03-19 19:53:04

pandas df子列中的字符串列表

问题描述

3 个解决方案

解决方案1 5 已采纳 2018-03-19 19:02:34

解决方案2 4 2018-03-19 19:01:46

解决方案3 1 2018-03-19 19:53:04

解决方案1
5 已采纳 2018-03-19 19:02:34

解决方案2
4 2018-03-19 19:01:46

解决方案3
1 2018-03-19 19:53:04