熊猫数据帧 str.contains() AND 操作

Question

我有一个包含三行的 df (Pandas Dataframe)：

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

函数df.col_name.str.contains("apple|banana")将捕获所有行：

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

如何将 AND 运算符应用于str.contains()方法，以便它只抓取包含“apple”和“banana”的字符串？

"apple and banana both are delicious"

我想抓取包含 10-20 个不同单词的字符串（葡萄、西瓜、浆果、橙子等）

Answer 1

你可以这样做：

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]

Answer 2

df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool

Answer 3

你也可以用正则表达式风格来做：

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

然后，您可以将单词列表构建为正则表达式字符串，如下所示：

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

将呈现：

'^(?=.*apple)(?=.*banana)(?=.*cat)'

然后你可以动态地做你的事情。

Answer 4

这有效

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

Answer 5

如果您只想使用本机方法并避免编写正则表达式，这里有一个不涉及 lambda 的矢量化版本：

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]

Answer 6

试试这个正则表达式

apple.*banana|banana.*apple

代码是：

import pandas as pd

df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))

print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]

输出

   ID                           String_Col
2   3  apple and banana both are delicious

Answer 7

如果你想在句子中至少捕捉两个单词，也许这会奏效（从@Alexander 那里得到提示）：

target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]

输出：

                                   col
2  apple and banana both are delicious

如果您有两个以上的单词要捕捉，并用逗号“,”分隔，然后将其添加到 connector_list 并将第二个条件从 all 修改为 any

df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]

输出：

                                        col
2        apple and banana both are delicious
3  orange,banana and apple all are delicious

Answer 8

枚举大型列表的所有可能性很麻烦。 更好的方法是使用reduce()和按位 AND运算符 ( & )。

例如，考虑以下 DataFrame：

df = pd.DataFrame({'col': ["apple is delicious",
                       "banana is delicious",
                       "apple and banana both are delicious",
                       "i love apple, banana, and strawberry"]})

#                                    col
#0                    apple is delicious
#1                   banana is delicious
#2   apple and banana both are delicious
#3  i love apple, banana, and strawberry

假设我们要搜索以下所有内容：

targets = ['apple', 'banana', 'strawberry']

我们可以这样做：

#from functools import reduce  # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])

#                                    col
#3  i love apple, banana, and strawberry

Answer 9

从@Anzel 的回答中，我编写了一个函数，因为我将大量应用它：

def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
    return base.format(''.join(expr.format(w) for w in words))

因此，如果您定义了words ：

words = ['apple', 'banana']

然后用类似的东西调用它：

dg = df.loc[
    df['col_name'].str.contains(regify(words), case=False, regex=True)
]

那么你应该得到你想要的。

熊猫数据帧 str.contains() AND 操作

问题描述

9 个解决方案

解决方案1
41 已采纳 2016-05-03 18:35:41

解决方案2
29 2016-05-03 19:57:46

解决方案3
28 2016-05-03 18:42:22

解决方案4
9 2018-07-25 18:38:26

解决方案5
5 2019-06-19 09:39:40

解决方案6
4 2016-05-03 18:54:50

解决方案7
3 2016-05-04 20:07:16

解决方案8
3 2018-03-12 14:05:27

解决方案9
0 2021-12-12 14:08:38

熊猫数据帧 str.contains() AND 操作

问题描述

9 个解决方案

解决方案1 41 已采纳 2016-05-03 18:35:41

解决方案2 29 2016-05-03 19:57:46

解决方案3 28 2016-05-03 18:42:22

解决方案4 9 2018-07-25 18:38:26

解决方案5 5 2019-06-19 09:39:40

解决方案6 4 2016-05-03 18:54:50

解决方案7 3 2016-05-04 20:07:16

解决方案8 3 2018-03-12 14:05:27

解决方案9 0 2021-12-12 14:08:38

解决方案1
41 已采纳 2016-05-03 18:35:41

解决方案2
29 2016-05-03 19:57:46

解决方案3
28 2016-05-03 18:42:22

解决方案4
9 2018-07-25 18:38:26

解决方案5
5 2019-06-19 09:39:40

解决方案6
4 2016-05-03 18:54:50

解决方案7
3 2016-05-04 20:07:16

解决方案8
3 2018-03-12 14:05:27

解决方案9
0 2021-12-12 14:08:38