简体   繁体   English

如何找到哪些句子的词汇最多?

[英]How to find which sentences have the most words in common?

Let us say i have a paragraph. 让我们说我有一个段落。 I separate this into sentences by sent_tokenize: 我通过sent_tokenize将它分成句子:

variable = ['By the 1870s the scientific community and much of the general public had accepted evolution as a fact.',
    'However, many favoured competing explanations and it was not until the emergence of the modern evolutionary synthesis from the 1930s to the 1950s that a broad consensus developed in which natural selection was the basic mechanism of evolution.',
    'Darwin published his theory of evolution with compelling evidence in his 1859 book On the Origin of Species, overcoming scientific rejection of earlier concepts of transmutation of species.']

Now i split each sentence into words and append it to some variable. 现在我将每个句子分成单词并将其附加到某个变量。 How can i find the two group of sentences which has the most number of same words. 如何找到具有最多相同单词数的两组句子。 I am not sure how to do this. 我不知道该怎么做。 If i have 10 sentences, then i will have 90 checks (between each sentence.) Thanks. 如果我有10个句子,那么我将有90个检查(每个句子之间。)谢谢。

You can use intersection of python sets . 你可以使用python 集的交集。

If you have three sentences as such: 如果你有三个句子:

a = "a b c d"
b = "a c x y"
c = "a q v"

You can check how many of the same words occur in two sentences by doing: 您可以通过执行以下操作来检查两个句子中出现的相同单词数量:

sameWords = set.intersection(set(a.split(" ")), set(c.split(" ")))
numberOfWords = len(sameWords)

With this you can iterate over your list of sentences, and find the two with the most sameWords in them. 通过这种方式,您可以遍历您的句子列表,并找到其中包含最相同词汇的两个句子。 This gives us: 这给了我们:

sentences = ["a b c d", "a d e f", "c x y", "a b c d x"]

def similar(s1, s2):
    sameWords = set.intersection(set(s1.split(" ")), set(s2.split(" ")))
    return len(sameWords)

currentSimilar = 0
s1 = ""
s2 = ""

for sentence in sentences:
    for sentence2 in sentences:
        if sentence is sentence2:
            continue
        similiarity = similar(sentence, sentence2)
        if (similiarity > currentSimilar):
            s1 = sentence
            s2 = sentence2
            currentSimilar = similiarity

print(s1, s2)

There might be some dynamic programming soultion to this problem if the performance is an issue. 如果性能是一个问题,可能会有一些动态编程灵魂来解决这个问题。

import itertools

sentences = ["There is no subtle meaning in this.", "Don't analyze this!", "What is this sentence?"]
decomposedsentences = ((index, set(sentence.strip(".?!,").split(" "))) for index, sentence in enumerate(sentences))
s1,s2 = max(itertools.combinations(decomposedsentences, 2), key = lambda sentences: len(sentences[0][1]&sentences[1][1]))
print("The two sentences with the most common words", sentences[s1[0]], sentences[s2[0]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM