简体   繁体   English

检查熊猫数据帧的一列是否包含不同列的每一行的子字符串?

[英]Check if a column of a pandas dataframe contains a substring for each row of a different column?

I have been stuck on what I considered originally to be a simple task for quite some while.很长一段时间以来,我一直坚持我最初认为是一项简单的任务。 Here I will use sample data as the actual problem data is much messier (and confidential).在这里,我将使用示例数据,因为实际问题数据更加混乱(和机密)。 Essentially I have two columns both containing strings.基本上我有两列都包含字符串。 I want to check for each row of column 'substring', if it is a substring of any of the rows of column 'string':我想检查“子字符串”列的每一行,如果它是“字符串”列的任何行的子字符串:

s1 = ['good', 'how', 'hello', 'start']
s2 = ['exit', 'hello you','where are you', 'goodbye']
test = pd.DataFrame({'substring':s1, 'string':s2})
>>> test

    string           substring
0   exit             good
1   hello you        how
2   where are you    hello
3   goodbye          start

Essentially I would like some indicator for each row if column A if it is a substring of anywhere in column B:本质上,如果 A 列是 B 列中任何位置的子字符串,我希望每一行都有一些指示符:

>>>test
    string           substring   C
0   exit             good        True
1   hello you        how         False
2   where are you    hello       True
3   goodbye          start       False

I have seemed to tried many things and I have just become lost.我似乎尝试了很多事情,但我刚刚迷失了方向。

I have tried iterating over the rows:我试过迭代行:

sub_test = pd.DataFrame(columns=test.columns)

    for index, row in test.iterrows():
        a = row['substring']
        delta = test[test['string'].str.contains(a)]
        if len(delta.index > 1):
            sub_test = pd.concat([sub_test, delta]) 

Which gets me some of the way and returns:这让我有所了解并返回:

>>>sub_test

    string      substring
3   goodbye     start
1   hello you   how

I would think there is a way of doing this using lambda but I have not been successful:我认为有一种方法可以使用 lambda 来做到这一点,但我没有成功:

test['C'] = test.apply(lambda row: row['substring'] in policies['substring'], axis = 1)

Any help would be appreciated.任何帮助,将不胜感激。 Thanks谢谢

Form one big pattern that we use to extract all substrings.形成一个我们用来提取所有子字符串的大模式。 Then we use an isin check to see if the substring matched anywhere.然后我们使用isin检查来查看substring是否在任何地方匹配。

p = '('+'|'.join(test.substring)+')'
test['C'] = test['substring'].isin(test['string'].str.extractall(p)[0].unique())

  substring          string      C
0      good            exit   True
1       how  hello good you  False
2     hello   where are you   True
3     start         goodbye  False

This works by str.extractall returning a DataFrame with matches.这通过str.extractall返回带有匹配项的 DataFrame 起作用。

test['string'].str.extractall(p)

             0
  match       
1 0      hello
3 0       good

The index is related to test 's index, not important here, with another level indicating the match number (since we use .extractall ).索引与test的索引有关,这里不重要,另一个级别表示匹配数(因为我们使用.extractall )。 The value is the substring that was matched.该值是匹配的子字符串。 Since our capturing group contained specific words (not a general pattern), we can use an equality check ( isin ) to get the mask for the 'substring' values.由于我们的捕获组包含特定的词(不是一般模式),我们可以使用相等检查( isin )来获取'substring'值的掩码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 检查数据框中每列的名称是否包含子字符串并更改数据类型 - Check if the name of each column in dataframe contains a substring and change data type Pandas - 检查列是否包含字符串的子字符串 - Pandas - Check if a column contains a substring of a string 对于 pandas dataframe 中的每一行,检查列是否包含最后 5 行中的字符串 - For every row in a pandas dataframe, check if a column contains a string in in the last 5 rows Pandas 检查 dataframe 列是否包含列表中的值(不同长度) - Pandas check if dataframe column contains value from list (different lengths) 在每列的第n行用不同的参数重新采样pandas DataFrame吗? - Resampling pandas DataFrame for every nth row with different parameters on each column? Pandas 子字符串 DataFrame 列 - Pandas substring DataFrame column 熊猫数据框列上的子字符串 - Substring on pandas dataframe column 如果熊猫数据框中包含特定的子字符串,则替换它的列值 - Replacing column values in a pandas dataframe based if it contains a specific substring 如果列表中的字符串在 Pandas DataFrame 列中包含 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ,如何替换它 - How to replace a string in a list if it contains a substring in Pandas DataFrame column 对于 Pandas dataframe 中的每一行,检查行是否包含列表中的字符串 - For each row in Pandas dataframe, check if row contains string from list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM