[英]Python - Find matching string(s) between DataFrame column (scrapped text) and list of strings
I am having an hard time comparing the strings from a DataFrame column with a list of strings.我很难将 DataFrame 列中的字符串与字符串列表进行比较。
Let me explain to you: I collected data from social media for a personal project, and aside of that I created a list of string like the following:让我向你解释:我从社交媒体收集了一个个人项目的数据,除此之外,我创建了一个如下所示的字符串列表:
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
There are other words but this is just to give you an idea.还有其他词,但这只是给你一个想法。
My goal is to compare EACH of this list's words, with 2 existing DF columns which contains titles and posts messages (from reddit).我的目标是将这个列表中的每个单词与包含标题和帖子消息(来自 reddit)的 2 个现有 DF 列进行比较。 To be clear, I want to create a new column where to display the words which match between my list to the columns containing the posts.明确地说,我想创建一个新列,在其中显示在我的列表与包含帖子的列之间匹配的单词。
So far, this is what I have done:到目前为止,这就是我所做的:
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
df['matched text'] = df.text_lemmatized.str.extract('({0})'.format('|'.join(the_list)), flags = re.IGNORECASE)
df = df[~pd.isna(df['matched text'])]
df
>>Outpout:
title_lemmatized text_lemmatized matched_word(s)
0 Title1 'claim thorough vet...' 'ai'
1 Title@ 'Yeaaah today iota...' 'IoT'
Here the output result for more details.这里是输出结果的更多细节。
The issue: The main problem is that its returning me letters (not actual words) that matches the list.问题:主要问题是它返回与列表匹配的字母(不是实际单词)。
Example:例子:
--> the_list = 'ai' (for artificial intelligence) or IoT (for Internet of Things) --> the_list = 'ai'(人工智能)或 IoT(物联网)
--> df['text_lemmatized'] has the word 'claim' in the text, then 'ai' will be the match. --> df['text_lemmatized'] 在文本中有'claim'这个词,那么'ai'将是匹配项。 or 'Iota' will match with 'IoT'.或“Iota”将与“IoT”匹配。
What I wish:我的愿望:
title_lemmatized text_lemmatized matched_word(s)
0 Title1 'AI claim that Iot devises...' 'AI', 'IoT'
1 Title2 'The claim story about...'
2 Title3 'augmented reality and ai are...' 'augmented reality', 'ai'
3 Title4 'AI ai or artificial intelligence' 'AI', 'ai', 'artificial intelligence'
Thanks lot :)非常感谢:)
You have to add word boundaries '\\b'
to your regex pattern.您必须在正则表达式模式中添加单词边界'\\b'
。 From the re module docs :从re 模块文档:
\\b
Matches the empty string, but only at the beginning or end of a word.匹配空字符串,但仅在单词的开头或结尾。 A word is defined as a sequence of word characters.一个词被定义为一个词字符序列。 Note that formally, \\b is defined as the boundary between a \\w and a \\W character (or vice versa), or between \\w and the beginning/end of the string.请注意,形式上,\\b 被定义为 \\w 和 \\W 字符之间(或反之亦然),或 \\w 和字符串开头/结尾之间的边界。 This means that r'\\bfoo\\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.这意味着 r'\\bfoo\\b' 匹配 'foo', 'foo.', '(foo)', 'bar foo baz' 但不匹配 'foobar' 或 'foo3'。
Besides that, you want to use Series.str.findall
(or Series.str.extractall
) instead of Series.str.extract
to find all the matches.除此之外,您想使用Series.str.findall
(或Series.str.extractall
)而不是Series.str.extract
来查找所有匹配项。
This should work这应该工作
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.