[英]Check if a pandas Dataframe string column contains all the elements given in an array
I have a dataframe as shown below:我有一个数据框,如下所示:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:我有一个如下所示的输入数组:
["app","web"]
对于这些值中的每一个,我想检查数据帧的特定列并返回一个决策,如下所示:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains
only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:由于
str.contains
只允许我查找单个值,我想知道是否有其他一些直接的方法来确定相同的值,例如:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match ( df.a.isin(["app", "web"]
) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.我的最终目标不是进行绝对匹配(
df.a.isin(["app", "web"]
),而是一个“包含”逻辑,即使该数据单元格中存在这些字符,也返回 true框架。
Note: I can of course use apply method to create my own function for the same logic such as:注意:我当然可以使用 apply 方法为相同的逻辑创建我自己的函数,例如:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.但我对这个的最佳算法更感兴趣,所以更喜欢在 Pandas 中使用原生 Pandas 函数,或者下一个最优化的自定义解决方案。
This should work too:这也应该有效:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:这也应该有效:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
Try with str.get_dummies
尝试使用
str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update更新
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):更新 2:(为了确保不区分大小写,列 dtype 是 str ,否则逻辑可能会失败):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)
so many solutions, which one is the most efficient
这么多解决方案,哪个最有效
The str.contains
-based answers are generally fastest, though str.findall
is also very fast on smaller dfs:基于
str.contains
的答案通常最快,尽管str.findall
在较小的 dfs 上也非常快:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.