[英]Exact word match and display in columns
I have the following dataframe (df) 我有以下数据框(df)
Comments ID
0 10 Looking for help
1 11 Look at him but be nice
2 12 Be calm
3 13 Being good
4 14 Him and Her
5 15 Himself
and some words in a list which I need to search for an EXACT match 和列表中的一些单词,我需要搜索完全匹配的单词
word_list = ['look','be','him']
This is my desired output 这是我想要的输出
Comments ID Word_01 Word_02 Word_03
0 10 Looking for help
1 11 Look at him but be nice look be him
2 12 Be calm be
3 13 Being good
4 14 Him and Her him
5 15 Himself
I've tried a few things like str.findall 我已经尝试了一些东西,例如str.findall
str.findall(r"\b" + '|'.join(word_list) + r"\b",flags = re.I)
and a few others but I can't seem to get EXACT matches for my words. 和其他一些,但我的文字似乎无法完全匹配。
Any help to solve this would be greatly appreciated. 任何帮助解决此问题的方法将不胜感激。
Thanks 谢谢
You may use the pandas' apply
function. 您可以使用熊猫的apply
功能。 Example: 例:
import pandas as pd
my_dataframe = pd.DataFrame({'Comments': [10, 11, 12, 13, 14, 15],
'ID': [
'Looking for help',
'Look at him but be nice',
'Be calm',
'Being good',
'Him and Her',
'Himself']
})
print(my_dataframe)
word_list = ['look','be','him']
word_list = ['look','be','him']
for index, word in enumerate(word_list):
def match_word(val):
"""
Under-optimized pattern matching
:param val:
:type val:
:return:
:rtype:
"""
if word.lower() in val.lower():
return word
return None
my_dataframe['Word_{}'.format(index)] = my_dataframe['ID'].apply(match_word)
print(my_dataframe)
Outputs: 输出:
Comments ID
0 10 Looking for help
1 11 Look at him but be nice
2 12 Be calm
3 13 Being good
4 14 Him and Her
5 15 Himself
Comments ID Word_0 Word_1 Word_2
0 10 Looking for help look None None
1 11 Look at him but be nice look be him
2 12 Be calm None be None
3 13 Being good None be None
4 14 Him and Her None None him
5 15 Himself None None him
You need word boundaries for each word. 您需要每个单词的单词边界。 One possible solution with Series.str.extractall
, DataFrame.add_prefix
and DataFrame.join
to original DataFrame
: 使用Series.str.extractall
, DataFrame.add_prefix
和DataFrame.join
到原始DataFrame
一种可能的解决方案:
word_list = ['look','be','him']
pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = df['ID'].str.extractall('(' + pat + ')', flags = re.I)[0].unstack().add_prefix('Word_')
For lowercase data in output add Series.str.lower
: 对于输出中的小写数据,请添加Series.str.lower
:
df1 = (df['ID'].str.lower()
.str.extractall('(' + pat + ')')[0]
.unstack()
.add_prefix('Word_'))
df = df.join(df1).fillna('')
print (df)
Comments ID Word_0 Word_1 Word_2
0 10 Looking for help
1 11 Look at him but be nice Look him be
2 12 Be calm Be
3 13 Being good
4 14 Him and Her Him
5 15 Himself
Your solution should be changed by same pattern, the convert values to list
s and join
to original: 您的解决方案应使用相同的模式进行更改,将值转换为list
并join
原始格式:
pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = (pd.DataFrame(df['ID']
.str.findall(pat, flags = re.I).values.tolist())
.add_prefix('Word_')
.fillna(''))
Or use list comprehension (should be fastest): 或使用列表理解(应该最快):
df1 = (pd.DataFrame([re.findall(pat, x, flags = re.I) for x in df['ID']])
.add_prefix('Word_')
.fillna(''))
For lowercase add .lower()
: 对于小写字母,请添加.lower()
:
pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = (pd.DataFrame([re.findall(pat, x.lower(), flags = re.I) for x in df['ID']])
.add_prefix('Word_')
.fillna(''))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.