简体   繁体   English

如何有效地定位句子中特定的单词序列

[英]How to locate specific sequences of words in a sentence efficiently

The problem is to find a time efficient function that receives as inputs a sentence of words and a list of sequences of varying amounts of words (also known as ngrams) and returns for every sequence a list of indexes indicating where they occur in the sentence, and do it as efficiently as possible for large amounts of sequences.问题是找到一个时间效率高的函数,它接收一个单词句子和一个不同数量单词的序列列表(也称为 ngrams)作为输入,并为每个序列返回一个索引列表,指示它们出现在句子中的位置,并尽可能高效地处理大量序列。

What I ultimately want is to replace the occurrences of ngrams in the sentence for a concatenation of the words in the sequence by "_".我最终想要的是用“_”替换句子中出现的 ngrams 以将序列中的单词串联起来。

For example if my sequences are ["hello", "world"] and ["my", "problem"], and the sentence is "hello world this is my problem can you solve it please?"例如,如果我的序列是 ["hello", "world"] 和 ["my", "problem"],并且句子是“hello world 这是我的问题,你能解决吗?” the function should return "hello_world this is my_problem can you solve it please?"该函数应该返回“hello_world 这是 my_problem 你能解决它吗?”

What I did is group the sequences by the amount of words each have and save that in a dictionary where the key is the amount and the value is a list of the sequences of that length.我所做的是按每个单词的数量对序列进行分组,并将其保存在字典中,其中键是数量,值是该长度的序列列表。

The variable ngrams is this dictionary:变量 ngrams 是这本字典:

def replaceNgrams(line, ngrams):
    words = line.split()
    #Iterates backwards in the length of the sequences
    for n in list(ngrams.keys())[::-1]: #O(L*T)
        newWords = []
        if len(words) >= n:
            terms = ngrams[n]
            i = 0
            while i < len(words)+1-n: #O(L*Tn)
                #Gets a sequences of words from the sentences of the same length of the ngrams currently checking
                nwords = words[i:i+n].copy()
                #Checks if that sequence is in my list of sequences
                if nwords in terms: #O(Tn)
                    newWords.append("_".join(nwords))
                    i+=n
                else:
                    newWords.append(words[i])
                    i+=1
            newWords += words[i:].copy()
            words = newWords.copy()
    return " ".join(words)

This works as desired but I have too many sequences and too many lines to apply this too and this is way too slow for me (it would take a month to finish).这可以正常工作,但我有太多的序列和太多的行来应用它,这对我来说太慢了(需要一个月才能完成)。

I think this can be achieved by basic string operations.我认为这可以通过基本的字符串操作来实现。 I'll first join all the sequences into single strings and then look for them in the full_text .我将首先将所有sequences成单个字符串,然后在full_text查找它们。 If found, I'll keep track of them in the output_dict with their start and end index.如果找到,我将在output_dict跟踪它们及其开始和结束索引。 You can use these indices as you require.您可以根据需要使用这些索引。


full_text = "hello world this is my problem can you solve it please?"

sequences = [["hello", "world"], ["my", "problem"]]

joined_sequences = [" ".join(sequence) for sequence in sequences]

def find_location(message, seq):
    if seq in message:
        return message.find(seq)
    else:
        return None

output_dict = {}

for sequence in joined_sequences:
    start_index = find_location(full_text, sequence)
    if start_index > -1:
        output_dict[sequence] = [start_index, start_index+len(sequence)]

print(output_dict)

This will output:这将输出:

{'hello world': [0, 11], 'my problem': [20, 30]}

Then you can do whatever you want with the start and end indices.然后你可以对开始和结束索引做任何你想做的事情。

If you only need to replace the values with underscores in the middle, you might not even need the indices.如果您只需要在中间用下划线替换值,您甚至可能不需要索引。

for sequence in joined_sequences:
    if sequence in full_text:
        full_text = full_text.replace(sequence, "_".join(sequence.split()))

print(full_text)

This should give you:这应该给你:

hello_world this is my_problem can you solve it please?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM