检查熊猫数据帧的一列是否包含不同列的每一行的子字符串？

Question

I have been stuck on what I considered originally to be a simple task for quite some while.很长一段时间以来，我一直坚持我最初认为是一项简单的任务。 Here I will use sample data as the actual problem data is much messier (and confidential).在这里，我将使用示例数据，因为实际问题数据更加混乱（和机密）。 Essentially I have two columns both containing strings.基本上我有两列都包含字符串。 I want to check for each row of column 'substring', if it is a substring of any of the rows of column 'string':我想检查“子字符串”列的每一行，如果它是“字符串”列的任何行的子字符串：

s1 = ['good', 'how', 'hello', 'start']
s2 = ['exit', 'hello you','where are you', 'goodbye']
test = pd.DataFrame({'substring':s1, 'string':s2})
>>> test

    string           substring
0   exit             good
1   hello you        how
2   where are you    hello
3   goodbye          start

Essentially I would like some indicator for each row if column A if it is a substring of anywhere in column B:本质上，如果 A 列是 B 列中任何位置的子字符串，我希望每一行都有一些指示符：

>>>test
    string           substring   C
0   exit             good        True
1   hello you        how         False
2   where are you    hello       True
3   goodbye          start       False

I have seemed to tried many things and I have just become lost.我似乎尝试了很多事情，但我刚刚迷失了方向。

I have tried iterating over the rows:我试过迭代行：

sub_test = pd.DataFrame(columns=test.columns)

    for index, row in test.iterrows():
        a = row['substring']
        delta = test[test['string'].str.contains(a)]
        if len(delta.index > 1):
            sub_test = pd.concat([sub_test, delta])

Which gets me some of the way and returns:这让我有所了解并返回：

>>>sub_test

    string      substring
3   goodbye     start
1   hello you   how

I would think there is a way of doing this using lambda but I have not been successful:我认为有一种方法可以使用 lambda 来做到这一点，但我没有成功：

test['C'] = test.apply(lambda row: row['substring'] in policies['substring'], axis = 1)

Any help would be appreciated.任何帮助，将不胜感激。 Thanks谢谢

Answer 1

Form one big pattern that we use to extract all substrings.形成一个我们用来提取所有子字符串的大模式。 Then we use an isin check to see if the substring matched anywhere.然后我们使用isin检查来查看substring是否在任何地方匹配。

p = '('+'|'.join(test.substring)+')'
test['C'] = test['substring'].isin(test['string'].str.extractall(p)[0].unique())

  substring          string      C
0      good            exit   True
1       how  hello good you  False
2     hello   where are you   True
3     start         goodbye  False

This works by str.extractall returning a DataFrame with matches.这通过str.extractall返回带有匹配项的 DataFrame 起作用。

test['string'].str.extractall(p)

             0
  match       
1 0      hello
3 0       good

The index is related to test 's index, not important here, with another level indicating the match number (since we use .extractall ).索引与test的索引有关，这里不重要，另一个级别表示匹配数（因为我们使用.extractall ）。 The value is the substring that was matched.该值是匹配的子字符串。 Since our capturing group contained specific words (not a general pattern), we can use an equality check ( isin ) to get the mask for the 'substring' values.由于我们的捕获组包含特定的词（不是一般模式），我们可以使用相等检查（ isin ）来获取'substring'值的掩码。

检查熊猫数据帧的一列是否包含不同列的每一行的子字符串？

问题描述

1 个解决方案

解决方案1
1 2020-01-14 16:32:14

检查熊猫数据帧的一列是否包含不同列的每一行的子字符串？

问题描述

1 个解决方案

解决方案1 1 2020-01-14 16:32:14

解决方案1
1 2020-01-14 16:32:14