簡體   English   中英

如何找到哪些句子的詞匯最多?

[英]How to find which sentences have the most words in common?

讓我們說我有一個段落。 我通過sent_tokenize將它分成句子:

variable = ['By the 1870s the scientific community and much of the general public had accepted evolution as a fact.',
    'However, many favoured competing explanations and it was not until the emergence of the modern evolutionary synthesis from the 1930s to the 1950s that a broad consensus developed in which natural selection was the basic mechanism of evolution.',
    'Darwin published his theory of evolution with compelling evidence in his 1859 book On the Origin of Species, overcoming scientific rejection of earlier concepts of transmutation of species.']

現在我將每個句子分成單詞並將其附加到某個變量。 如何找到具有最多相同單詞數的兩組句子。 我不知道該怎么做。 如果我有10個句子,那么我將有90個檢查(每個句子之間。)謝謝。

你可以使用python 集的交集。

如果你有三個句子:

a = "a b c d"
b = "a c x y"
c = "a q v"

您可以通過執行以下操作來檢查兩個句子中出現的相同單詞數量:

sameWords = set.intersection(set(a.split(" ")), set(c.split(" ")))
numberOfWords = len(sameWords)

通過這種方式,您可以遍歷您的句子列表,並找到其中包含最相同詞匯的兩個句子。 這給了我們:

sentences = ["a b c d", "a d e f", "c x y", "a b c d x"]

def similar(s1, s2):
    sameWords = set.intersection(set(s1.split(" ")), set(s2.split(" ")))
    return len(sameWords)

currentSimilar = 0
s1 = ""
s2 = ""

for sentence in sentences:
    for sentence2 in sentences:
        if sentence is sentence2:
            continue
        similiarity = similar(sentence, sentence2)
        if (similiarity > currentSimilar):
            s1 = sentence
            s2 = sentence2
            currentSimilar = similiarity

print(s1, s2)

如果性能是一個問題,可能會有一些動態編程靈魂來解決這個問題。

import itertools

sentences = ["There is no subtle meaning in this.", "Don't analyze this!", "What is this sentence?"]
decomposedsentences = ((index, set(sentence.strip(".?!,").split(" "))) for index, sentence in enumerate(sentences))
s1,s2 = max(itertools.combinations(decomposedsentences, 2), key = lambda sentences: len(sentences[0][1]&sentences[1][1]))
print("The two sentences with the most common words", sentences[s1[0]], sentences[s2[0]])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM