Python - 在 DataFrame 列（废弃文本）和字符串列表之间查找匹配的字符串

Question

I am having an hard time comparing the strings from a DataFrame column with a list of strings.我很难将 DataFrame 列中的字符串与字符串列表进行比较。

Let me explain to you: I collected data from social media for a personal project, and aside of that I created a list of string like the following:让我向你解释：我从社交媒体收集了一个个人项目的数据，除此之外，我创建了一个如下所示的字符串列表：

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

There are other words but this is just to give you an idea.还有其他词，但这只是给你一个想法。

My goal is to compare EACH of this list's words, with 2 existing DF columns which contains titles and posts messages (from reddit).我的目标是将这个列表中的每个单词与包含标题和帖子消息（来自 reddit）的 2 个现有 DF 列进行比较。 To be clear, I want to create a new column where to display the words which match between my list to the columns containing the posts.明确地说，我想创建一个新列，在其中显示在我的列表与包含帖子的列之间匹配的单词。

So far, this is what I have done:到目前为止，这就是我所做的：

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

df['matched text'] = df.text_lemmatized.str.extract('({0})'.format('|'.join(the_list)), flags = re.IGNORECASE)
df = df[~pd.isna(df['matched text'])]

df

>>Outpout:

      title_lemmatized   text_lemmatized        matched_word(s)
0         Title1       'claim thorough vet...'      'ai'
1         Title@       'Yeaaah today iota...'       'IoT'

Here the output result for more details.这里是输出结果的更多细节。

The issue: The main problem is that its returning me letters (not actual words) that matches the list.问题：主要问题是它返回与列表匹配的字母（不是实际单词）。

Example:例子：

--> the_list = 'ai' (for artificial intelligence) or IoT (for Internet of Things) --> the_list = 'ai'（人工智能）或 IoT（物联网）

--> df['text_lemmatized'] has the word 'claim' in the text, then 'ai' will be the match. --> df['text_lemmatized'] 在文本中有'claim'这个词，那么'ai'将是匹配项。 or 'Iota' will match with 'IoT'.或“Iota”将与“IoT”匹配。

What I wish:我的愿望：

   title_lemmatized       text_lemmatized             matched_word(s)
0    Title1         'AI claim that Iot devises...'      'AI', 'IoT'
1    Title2         'The claim story about...'
2    Title3         'augmented reality and ai are...'   'augmented reality', 'ai'
3    Title4         'AI ai or artificial intelligence'  'AI', 'ai', 'artificial intelligence'

Thanks lot :)非常感谢:)

Answer 1

You have to add word boundaries '\\b' to your regex pattern.您必须在正则表达式模式中添加单词边界'\\b' 。 From the re module docs :从re 模块文档：

\\b

Matches the empty string, but only at the beginning or end of a word.匹配空字符串，但仅在单词的开头或结尾。 A word is defined as a sequence of word characters.一个词被定义为一个词字符序列。 Note that formally, \\b is defined as the boundary between a \\w and a \\W character (or vice versa), or between \\w and the beginning/end of the string.请注意，形式上，\\b 被定义为 \\w 和 \\W 字符之间（或反之亦然），或 \\w 和字符串开头/结尾之间的边界。 This means that r'\\bfoo\\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.这意味着 r'\\bfoo\\b' 匹配 'foo', 'foo.', '(foo)', 'bar foo baz' 但不匹配 'foobar' 或 'foo3'。

Besides that, you want to use Series.str.findall (or Series.str.extractall ) instead of Series.str.extract to find all the matches.除此之外，您想使用Series.str.findall （或Series.str.extractall ）而不是Series.str.extract来查找所有匹配项。

This should work这应该工作

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)

Python - 在 DataFrame 列（废弃文本）和字符串列表之间查找匹配的字符串

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-10-31 23:12:20

Python - 在 DataFrame 列（废弃文本）和字符串列表之间查找匹配的字符串

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-10-31 23:12:20

解决方案1
0 已采纳 2021-10-31 23:12:20