[英]Check if a column of a pandas dataframe contains a substring for each row of a different column?
I have been stuck on what I considered originally to be a simple task for quite some while.很长一段时间以来,我一直坚持我最初认为是一项简单的任务。 Here I will use sample data as the actual problem data is much messier (and confidential).
在这里,我将使用示例数据,因为实际问题数据更加混乱(和机密)。 Essentially I have two columns both containing strings.
基本上我有两列都包含字符串。 I want to check for each row of column 'substring', if it is a substring of any of the rows of column 'string':
我想检查“子字符串”列的每一行,如果它是“字符串”列的任何行的子字符串:
s1 = ['good', 'how', 'hello', 'start']
s2 = ['exit', 'hello you','where are you', 'goodbye']
test = pd.DataFrame({'substring':s1, 'string':s2})
>>> test
string substring
0 exit good
1 hello you how
2 where are you hello
3 goodbye start
Essentially I would like some indicator for each row if column A if it is a substring of anywhere in column B:本质上,如果 A 列是 B 列中任何位置的子字符串,我希望每一行都有一些指示符:
>>>test
string substring C
0 exit good True
1 hello you how False
2 where are you hello True
3 goodbye start False
I have seemed to tried many things and I have just become lost.我似乎尝试了很多事情,但我刚刚迷失了方向。
I have tried iterating over the rows:我试过迭代行:
sub_test = pd.DataFrame(columns=test.columns)
for index, row in test.iterrows():
a = row['substring']
delta = test[test['string'].str.contains(a)]
if len(delta.index > 1):
sub_test = pd.concat([sub_test, delta])
Which gets me some of the way and returns:这让我有所了解并返回:
>>>sub_test
string substring
3 goodbye start
1 hello you how
I would think there is a way of doing this using lambda but I have not been successful:我认为有一种方法可以使用 lambda 来做到这一点,但我没有成功:
test['C'] = test.apply(lambda row: row['substring'] in policies['substring'], axis = 1)
Any help would be appreciated.任何帮助,将不胜感激。 Thanks
谢谢
Form one big pattern that we use to extract all substrings.形成一个我们用来提取所有子字符串的大模式。 Then we use an
isin
check to see if the substring
matched anywhere.然后我们使用
isin
检查来查看substring
是否在任何地方匹配。
p = '('+'|'.join(test.substring)+')'
test['C'] = test['substring'].isin(test['string'].str.extractall(p)[0].unique())
substring string C
0 good exit True
1 how hello good you False
2 hello where are you True
3 start goodbye False
This works by str.extractall
returning a DataFrame with matches.这通过
str.extractall
返回带有匹配项的 DataFrame 起作用。
test['string'].str.extractall(p)
0
match
1 0 hello
3 0 good
The index is related to test
's index, not important here, with another level indicating the match number (since we use .extractall
).索引与
test
的索引有关,这里不重要,另一个级别表示匹配数(因为我们使用.extractall
)。 The value is the substring that was matched.该值是匹配的子字符串。 Since our capturing group contained specific words (not a general pattern), we can use an equality check (
isin
) to get the mask for the 'substring'
values.由于我们的捕获组包含特定的词(不是一般模式),我们可以使用相等检查(
isin
)来获取'substring'
值的掩码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.