简体   繁体   English

Python - 在 DataFrame 列(废弃文本)和字符串列表之间查找匹配的字符串

[英]Python - Find matching string(s) between DataFrame column (scrapped text) and list of strings

I am having an hard time comparing the strings from a DataFrame column with a list of strings.我很难将 DataFrame 列中的字符串与字符串列表进行比较。

Let me explain to you: I collected data from social media for a personal project, and aside of that I created a list of string like the following:让我向你解释:我从社交媒体收集了一个个人项目的数据,除此之外,我创建了一个如下所示的字符串列表:

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

There are other words but this is just to give you an idea.还有其他词,但这只是给你一个想法。

My goal is to compare EACH of this list's words, with 2 existing DF columns which contains titles and posts messages (from reddit).我的目标是将这个列表中的每个单词与包含标题和帖子消息(来自 reddit)的 2 个现有 DF 列进行比较。 To be clear, I want to create a new column where to display the words which match between my list to the columns containing the posts.明确地说,我想创建一个新列,在其中显示在我的列表与包含帖子的列之间匹配的单词。

So far, this is what I have done:到目前为止,这就是我所做的:

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

df['matched text'] = df.text_lemmatized.str.extract('({0})'.format('|'.join(the_list)), flags = re.IGNORECASE)
df = df[~pd.isna(df['matched text'])]

df

>>Outpout:

      title_lemmatized   text_lemmatized        matched_word(s)
0         Title1       'claim thorough vet...'      'ai'
1         Title@       'Yeaaah today iota...'       'IoT'

Here the output result for more details.这里是输出结果的更多细节。

The issue: The main problem is that its returning me letters (not actual words) that matches the list.问题:主要问题是它返回与列表匹配的字母(不是实际单词)。

Example:例子:

--> the_list = 'ai' (for artificial intelligence) or IoT (for Internet of Things) --> the_list = 'ai'(人工智能)或 IoT(物联网)

--> df['text_lemmatized'] has the word 'claim' in the text, then 'ai' will be the match. --> df['text_lemmatized'] 在文本中有'claim'这个词,那么'ai'将是匹配项。 or 'Iota' will match with 'IoT'.或“Iota”将与“IoT”匹配。

What I wish:我的愿望:

   title_lemmatized       text_lemmatized             matched_word(s)
0    Title1         'AI claim that Iot devises...'      'AI', 'IoT'
1    Title2         'The claim story about...'
2    Title3         'augmented reality and ai are...'   'augmented reality', 'ai'
3    Title4         'AI ai or artificial intelligence'  'AI', 'ai', 'artificial intelligence'

Thanks lot :)非常感谢:)

You have to add word boundaries '\\b' to your regex pattern.您必须在正则表达式模式中添加单词边界'\\b' From the re module docs :re 模块文档

\\b

Matches the empty string, but only at the beginning or end of a word.匹配空字符串,但仅在单词的开头或结尾。 A word is defined as a sequence of word characters.一个词被定义为一个词字符序列。 Note that formally, \\b is defined as the boundary between a \\w and a \\W character (or vice versa), or between \\w and the beginning/end of the string.请注意,形式上,\\b 被定义为 \\w 和 \\W 字符之间(或反之亦然),或 \\w 和字符串开头/结尾之间的边界。 This means that r'\\bfoo\\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.这意味着 r'\\bfoo\\b' 匹配 'foo', 'foo.', '(foo)', 'bar foo baz' 但不匹配 'foobar' 或 'foo3'。

Besides that, you want to use Series.str.findall (or Series.str.extractall ) instead of Series.str.extract to find all the matches.除此之外,您想使用Series.str.findall (或Series.str.extractall )而不是Series.str.extract来查找所有匹配项。

This should work这应该工作

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python 数据框匹配列表中的字符串 - Python dataframe matching strings in a list 在字典和列表之间查找匹配的字符串并用字符串替换匹配项 - Find matching strings between dictionary and list and replace matches with string 在pandas的文本列中的两个字符串之间查找多次出现的字符串 - Find multiple occurrences of a string between two strings in a column of text in pandas 在两个数据框之间搜索匹配的字符串,然后使用函数(Pandas)将匹配列的名称分配给另一个数据框 - Search for a matching string between two dataframes, and assign the matching column's name to the other dataframe with a function (Pandas) 将长列表与 dataframe 中的字符串进行比较,并在匹配的基础上填充 Python 中的 dataframe - compare a long list with strings in dataframe and on the basis of matching populate the dataframe in Python 检查 Pandas DataFrame 列中的字符串是否在字符串列表中 - Check if a string in a Pandas DataFrame column is in a list of strings 如何通过字符串列表替换 dataframe 列中的字符串 - How to replace string in dataframe column by list of strings Python,用数据框列匹配和替换字符串列表 - Python, match and replace list of strings with dataframe column 通过比较替换 python 中列表/数据框列中的字符串 - Replacing the strings in list/dataframe column in python by comparison 带有字符串列表的 Python DataFrame 列不会变平 - Python DataFrame column with list of strings does not flatten
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM