[英]pandas dataframe str.contains() AND operation
我有一个包含三行的 df (Pandas Dataframe):
some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"
函数df.col_name.str.contains("apple|banana")
将捕获所有行:
"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".
如何将 AND 运算符应用于str.contains()
方法,以便它只抓取包含“apple”和“banana”的字符串?
"apple and banana both are delicious"
我想抓取包含 10-20 个不同单词的字符串(葡萄、西瓜、浆果、橙子等)
你可以这样做:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious"]})
targets = ['apple', 'banana']
# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0 True
1 True
2 True
Name: col, dtype: bool
# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0 False
1 False
2 True
Name: col, dtype: bool
你也可以用正则表达式风格来做:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
然后,您可以将单词列表构建为正则表达式字符串,如下所示:
base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
将呈现:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
然后你可以动态地做你的事情。
这有效
df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
如果您只想使用本机方法并避免编写正则表达式,这里有一个不涉及 lambda 的矢量化版本:
targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]
试试这个正则表达式
apple.*banana|banana.*apple
代码是:
import pandas as pd
df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))
print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]
输出
ID String_Col
2 3 apple and banana both are delicious
如果你想在句子中至少捕捉两个单词,也许这会奏效(从@Alexander 那里得到提示):
target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]
输出:
col
2 apple and banana both are delicious
如果您有两个以上的单词要捕捉,并用逗号“,”分隔,然后将其添加到 connector_list 并将第二个条件从 all 修改为 any
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]
输出:
col
2 apple and banana both are delicious
3 orange,banana and apple all are delicious
枚举大型列表的所有可能性很麻烦。 更好的方法是使用reduce()
和按位 AND运算符 ( &
)。
例如,考虑以下 DataFrame:
df = pd.DataFrame({'col': ["apple is delicious",
"banana is delicious",
"apple and banana both are delicious",
"i love apple, banana, and strawberry"]})
# col
#0 apple is delicious
#1 banana is delicious
#2 apple and banana both are delicious
#3 i love apple, banana, and strawberry
假设我们要搜索以下所有内容:
targets = ['apple', 'banana', 'strawberry']
我们可以这样做:
#from functools import reduce # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])
# col
#3 i love apple, banana, and strawberry
从@Anzel 的回答中,我编写了一个函数,因为我将大量应用它:
def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
return base.format(''.join(expr.format(w) for w in words))
因此,如果您定义了words
:
words = ['apple', 'banana']
然后用类似的东西调用它:
dg = df.loc[
df['col_name'].str.contains(regify(words), case=False, regex=True)
]
那么你应该得到你想要的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.