简体   繁体   English

单词分词器中的子字符串匹配

[英]Sub-string match in word tokenizer

I have defined a function that returns me the sentences containing specified word from an excel file having a 'text' column. 我定义了一个函数,该函数可以从具有“文本”列的excel文件中返回包含指定单词的句子。 And with the help of @Julien Marrec I redefined the function so that I could pass multiple words as argument as below 借助@Julien Marrec的帮助,我重新定义了该函数,以便可以将多个单词作为参数传递,如下所示

words = ['word1','word2','word3'.......]
df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                   if any(True for w in word_tokenize(sent) 
                                           if w.lower() in searched_words)])

But the problem is dataset is pretty huge(typically in GB's) and unstructured. 但是问题在于数据集非常庞大(通常以GB为单位)并且是非结构化的。 Can someone suggest me how can I have a substring match to happen too ie if a sentence has 'xxxxxword1yyyyy' my function should be able to return this sentence as well. 有人可以建议我如何进行子字符串匹配吗,即,如果句子中包含“ xxxxxword1yyyyy”,我的函数也应该能够返回该句子。

If you don't care about word boundaries, you can skip word tokenisation and just match with a regular expression. 如果您不在乎单词边界,则可以跳过单词标记化,而只需与正则表达式匹配即可。

However, this might give you a lot of matches that you didn't expect. 但是,这可能会给您带来很多意想不到的比赛。 For example, the search terms "tin" and "nation" will both match in the word "procrastination". 例如,搜索词“ tin”和“ nation”都将在单词“ procrastination”中匹配。 If that is what you want, you can do the following: 如果这是您想要的,则可以执行以下操作:

import re

fsa = re.compile('|'.join(re.escape(w.lower()) for w in searched_words))
df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                               if fsa.search(sent)])

The re.compile() expression creates a regex pattern object, which consists simply of a set of alternatives. re.compile()表达式创建一个正则表达式模式对象,该对象仅由一组替代项组成。 This allows you to scan through the complete sentence, looking out for all of the searched words at the same time. 这使您可以浏览整个句子,同时查找所有搜索到的单词。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM