繁体   English   中英

熊猫数据帧 str.contains() AND 操作

[英]pandas dataframe str.contains() AND operation

我有一个包含三行的 df (Pandas Dataframe):

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

函数df.col_name.str.contains("apple|banana")将捕获所有行:

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

如何将 AND 运算符应用于str.contains()方法,以便它只抓取包含“apple”和“banana”的字符串?

"apple and banana both are delicious"

我想抓取包含 10-20 个不同单词的字符串(葡萄、西瓜、浆果、橙子等)

你可以这样做:

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool

你也可以用正则表达式风格来做:

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

然后,您可以将单词列表构建为正则表达式字符串,如下所示:

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

将呈现:

'^(?=.*apple)(?=.*banana)(?=.*cat)'

然后你可以动态地做你的事情。

这有效

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

如果您只想使用本机方法并避免编写正则表达式,这里有一个不涉及 lambda 的矢量化版本:

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]

试试这个正则表达式

apple.*banana|banana.*apple

代码是:

import pandas as pd

df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))

print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]

输出

   ID                           String_Col
2   3  apple and banana both are delicious

如果你想在句子中至少捕捉两个单词,也许这会奏效(从@Alexander 那里得到提示):

target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]

输出:

                                   col
2  apple and banana both are delicious

如果您有两个以上的单词要捕捉,并用逗号“,”分隔,然后将其添加到 connector_list 并将第二个条件从 all 修改为 any

df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]

输出:

                                        col
2        apple and banana both are delicious
3  orange,banana and apple all are delicious

枚举大型列表的所有可能性很麻烦。 更好的方法是使用reduce()按位 AND运算符 ( & )。

例如,考虑以下 DataFrame:

df = pd.DataFrame({'col': ["apple is delicious",
                       "banana is delicious",
                       "apple and banana both are delicious",
                       "i love apple, banana, and strawberry"]})

#                                    col
#0                    apple is delicious
#1                   banana is delicious
#2   apple and banana both are delicious
#3  i love apple, banana, and strawberry

假设我们要搜索以下所有内容:

targets = ['apple', 'banana', 'strawberry']

我们可以这样做:

#from functools import reduce  # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])

#                                    col
#3  i love apple, banana, and strawberry

从@Anzel 的回答中,我编写了一个函数,因为我将大量应用它:

def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
    return base.format(''.join(expr.format(w) for w in words))

因此,如果您定义了words

words = ['apple', 'banana']

然后用类似的东西调用它:

dg = df.loc[
    df['col_name'].str.contains(regify(words), case=False, regex=True)
]

那么你应该得到你想要的。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM