简体   繁体   English

返回评论列表和句子列表中的单词列表

[英]Return list of comments with list of words in list of sentences

I have a long (1.5m) list of sentences and a similarly long list of words that I am looking for within the list of sentences. 我有一个很长的句子列表(1.5m),而我正在句子列表中寻找的单词列表也很长。 For example: 例如:

list_of_words = ['Turin', 'Milan']
list_of_sents = ['This is a sent about turin.', 'This is a sent about manufacturing.']

I would like to have a function that is able to return those sentences that contain the target word without alphanumeric characters next to them. 我希望有一个函数能够返回那些包含目标单词的句子,这些句子旁边没有字母数字字符。 In other words, only the first of the sentences above should match. 换句话说,只有上面的句子中的第一个应该匹配。

I have developed the function below but it takes too long to parse through each the millions of words and sentences. 我已经在下面开发了该函数,但要解析成千上万个单词和句子需要花费很长时间。 I was wondering if there is a package or alternative that could mitigate this computational intensity. 我想知道是否有一个软件包或替代品可以减轻这种计算强度。

def find_target_sents(list_of_words, list_of_sents):
     target_sents = []
     i, j = 0, 0
     word_len = len(list_of_words)
     sent_len = len(list_of_sents)
     for word in list_of_words:
        i += 1
        for sent in list_of_sents:
            j += 1
            print('%s out of %s words and %s out of %s sentences' % (j, word_len , i, sent_len))
            match = re.compile(r'\%s\b' % word, re.I)
            y = match.search(sent)
            if y != None:
                print(sent)
                t = (word, sentence)
                target_sent.append(t)
     print(target_sent)

if you could build a string with all the words to search from list_of_words like (Turin|Milan) , you could do a regex match on: 如果您可以用所有单词构建一个字符串来从list_of_words中搜索,例如(Turin|Milan) ,则可以对以下内容进行正则表达式匹配:

^.*\b(Turin|Milan)\b.*$

Also, we could avoid both the for loops, as mentioned in this answer . 而且,我们可以避免两个for循环,如本答案所述

Could just build sets and use their constant time membership check: 可以只构建集合并使用其恒定时间成员资格检查:

from string import punctuation

def find_target_sents(words, sents):
    # translation table
    table = str.maketrans('', '', punctuation) 
    # hold found sentences by word
    found = {word: [] for word in words} 
    # make unique sets for each sentence and remove punctuation
    parsed = [set(sent.translate(table).split()) for sent in sents]
    # check
    for word in found:
        for sent in parsed:
            if word in sent:
                found[word].append(sent)

Basically this assumes your sentences follow English grammar and are separated with spaces following a chracter or punctionation (assuming another word will follow of course). 基本上,这是假设您的句子遵循英语语法,并且在作词或标点之后用空格分隔(假设当然会跟着另一个单词)。

It takes each sentence and removes any punctuation from it, then splits on whitespace and turns the result into a set which has constant time, O(1), membership check. 它接受每个句子并从中删除所有标点符号,然后在空白处分割并将结果转换为具有恒定时间O(1)的隶属度检查的set

So the sentence: "I, want to go, to Burger King!" 所以这句话是: "I, want to go, to Burger King!" ... ...

... becomes ('I', 'want', 'to', 'go', 'Burger', 'King') ; ...成为('I', 'want', 'to', 'go', 'Burger', 'King') ; where only unique elements exist! 只有唯一的元素存在!

Obviously there are issues if you are looking for 'Burger King' but thats technically two words... 显然,如果您正在寻找'Burger King'就会有问题,但是从技术上讲,这就是两个单词...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM