检查 Pandas Dataframe 字符串列是否包含数组中给定的所有元素

Question

I have a dataframe as shown below:我有一个数据框，如下所示：

>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
           a  b  c  d
0       app;  1  2  3
1  app; web;  4  5  6
2       web;  7  8  9
3             1  4  5

I have an input array that looks like this: ["app","web"] For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:我有一个如下所示的输入数组： ["app","web"]对于这些值中的每一个，我想检查数据帧的特定列并返回一个决策，如下所示：

>>> df.a.str.contains("app")
0     True
1     True
2    False
3    False

Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:由于str.contains只允许我查找单个值，我想知道是否有其他一些直接的方法来确定相同的值，例如：

 df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'

My end goal is not to do an absolute match ( df.a.isin(["app", "web"] ) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.我的最终目标不是进行绝对匹配（ df.a.isin(["app", "web"] ），而是一个“包含”逻辑，即使该数据单元格中存在这些字符，也返回 true框架。

Note: I can of course use apply method to create my own function for the same logic such as:注意：我当然可以使用 apply 方法为相同的逻辑创建我自己的函数，例如：

elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))

But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.但我对这个的最佳算法更感兴趣，所以更喜欢在 Pandas 中使用原生 Pandas 函数，或者下一个最优化的自定义解决方案。

Answer 1

This should work too:这也应该有效：

l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))

also this should work as well:这也应该有效：

pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)

Answer 2

Try with str.get_dummies尝试使用str.get_dummies

df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0    False
1     True
2    False
3    False
dtype: bool

Update更新

df['a'].str.contains(r'^(?=.*web)(?=.*app)')

Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):更新 2：（为了确保不区分大小写，列 dtype 是 str ，否则逻辑可能会失败）：

elementList = ['app','web']
for eachValue in elementList:
                    valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)

Answer 3

so many solutions, which one is the most efficient这么多解决方案，哪个最有效

The str.contains -based answers are generally fastest, though str.findall is also very fast on smaller dfs:基于str.contains的答案通常最快，尽管str.findall在较小的 dfs 上也非常快：

values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)

def replace_dummies_all(df):
    return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)

def findall_map(df):
    return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))

def lower_contains(df):
    return df.a.astype(str).str.lower().str.contains(pattern)

def contains_concat_all(df):
    return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)

def contains(df):
    return df.a.str.contains(pattern)

检查 Pandas Dataframe 字符串列是否包含数组中给定的所有元素

问题描述

3 个解决方案

解决方案1
2 2021-07-09 03:24:23

解决方案2
1 已采纳 2021-07-08 03:28:10

解决方案3
1 2021-07-11 06:42:48

检查 Pandas Dataframe 字符串列是否包含数组中给定的所有元素

问题描述

3 个解决方案

解决方案1 2 2021-07-09 03:24:23

解决方案2 1 已采纳 2021-07-08 03:28:10

解决方案3 1 2021-07-11 06:42:48

解决方案1
2 2021-07-09 03:24:23

解决方案2
1 已采纳 2021-07-08 03:28:10

解决方案3
1 2021-07-11 06:42:48