简体   繁体   English

精确的单词匹配并按列显示

[英]Exact word match and display in columns

I have the following dataframe (df) 我有以下数据框(df)

   Comments                       ID
0        10         Looking for help
1        11  Look at him but be nice
2        12                  Be calm
3        13               Being good
4        14              Him and Her
5        15                  Himself

and some words in a list which I need to search for an EXACT match 和列表中的一些单词,我需要搜索完全匹配的单词

word_list = ['look','be','him']

This is my desired output 这是我想要的输出

   Comments                       ID Word_01 Word_02 Word_03
0        10         Looking for help                        
1        11  Look at him but be nice    look     be      him
2        12                  Be calm    be                
3        13               Being good                        
4        14              Him and Her    him                
5        15                  Himself  

I've tried a few things like str.findall 我已经尝试了一些东西,例如str.findall

str.findall(r"\b" + '|'.join(word_list) + r"\b",flags = re.I)

and a few others but I can't seem to get EXACT matches for my words. 和其他一些,但我的文字似乎无法完全匹配。

Any help to solve this would be greatly appreciated. 任何帮助解决此问题的方法将不胜感激。

Thanks 谢谢

You may use the pandas' apply function. 您可以使用熊猫的apply功能。 Example: 例:

import pandas as pd

my_dataframe = pd.DataFrame({'Comments': [10, 11, 12, 13, 14, 15],
                             'ID': [
                                 'Looking for help',
                                 'Look at him but be nice',
                                 'Be calm',
                                 'Being good',
                                 'Him and Her',
                                 'Himself']
                             })

print(my_dataframe)

word_list = ['look','be','him']


word_list = ['look','be','him']
for index, word in enumerate(word_list):
    def match_word(val):
        """
        Under-optimized pattern matching
        :param val:
        :type val:
        :return:
        :rtype:
        """
        if word.lower() in val.lower():
            return word
        return None
    my_dataframe['Word_{}'.format(index)] = my_dataframe['ID'].apply(match_word)

print(my_dataframe)

Outputs: 输出:

   Comments                       ID
0        10         Looking for help
1        11  Look at him but be nice
2        12                  Be calm
3        13               Being good
4        14              Him and Her
5        15                  Himself

   Comments                       ID Word_0 Word_1 Word_2
0        10         Looking for help   look   None   None
1        11  Look at him but be nice   look     be    him
2        12                  Be calm   None     be   None
3        13               Being good   None     be   None
4        14              Him and Her   None   None    him
5        15                  Himself   None   None    him

You need word boundaries for each word. 您需要每个单词的单词边界。 One possible solution with Series.str.extractall , DataFrame.add_prefix and DataFrame.join to original DataFrame : 使用Series.str.extractallDataFrame.add_prefixDataFrame.join到原始DataFrame一种可能的解决方案:

word_list = ['look','be','him']

pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = df['ID'].str.extractall('(' + pat + ')', flags = re.I)[0].unstack().add_prefix('Word_')

For lowercase data in output add Series.str.lower : 对于输出中的小写数据,请添加Series.str.lower

df1 = (df['ID'].str.lower()
               .str.extractall('(' + pat + ')')[0]
               .unstack()
               .add_prefix('Word_'))

df = df.join(df1).fillna('')
print (df)
   Comments                       ID Word_0 Word_1 Word_2
0        10         Looking for help                     
1        11  Look at him but be nice   Look    him     be
2        12                  Be calm     Be              
3        13               Being good                     
4        14              Him and Her    Him              
5        15                  Himself              

Your solution should be changed by same pattern, the convert values to list s and join to original: 您的解决方案应使用相同的模式进行更改,将值转换为listjoin原始格式:

pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = (pd.DataFrame(df['ID']
        .str.findall(pat, flags = re.I).values.tolist())
        .add_prefix('Word_')
        .fillna(''))   

Or use list comprehension (should be fastest): 或使用列表理解(应该最快):

df1 = (pd.DataFrame([re.findall(pat, x, flags = re.I) for x in df['ID']])
       .add_prefix('Word_')
       .fillna(''))

For lowercase add .lower() : 对于小写字母,请添加.lower()

pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = (pd.DataFrame([re.findall(pat, x.lower(), flags = re.I) for x in df['ID']])
           .add_prefix('Word_')
           .fillna(''))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM