简体   繁体   中英

How to find which sentences have the most words in common?

Let us say i have a paragraph. I separate this into sentences by sent_tokenize:

variable = ['By the 1870s the scientific community and much of the general public had accepted evolution as a fact.',
    'However, many favoured competing explanations and it was not until the emergence of the modern evolutionary synthesis from the 1930s to the 1950s that a broad consensus developed in which natural selection was the basic mechanism of evolution.',
    'Darwin published his theory of evolution with compelling evidence in his 1859 book On the Origin of Species, overcoming scientific rejection of earlier concepts of transmutation of species.']

Now i split each sentence into words and append it to some variable. How can i find the two group of sentences which has the most number of same words. I am not sure how to do this. If i have 10 sentences, then i will have 90 checks (between each sentence.) Thanks.

You can use intersection of python sets .

If you have three sentences as such:

a = "a b c d"
b = "a c x y"
c = "a q v"

You can check how many of the same words occur in two sentences by doing:

sameWords = set.intersection(set(a.split(" ")), set(c.split(" ")))
numberOfWords = len(sameWords)

With this you can iterate over your list of sentences, and find the two with the most sameWords in them. This gives us:

sentences = ["a b c d", "a d e f", "c x y", "a b c d x"]

def similar(s1, s2):
    sameWords = set.intersection(set(s1.split(" ")), set(s2.split(" ")))
    return len(sameWords)

currentSimilar = 0
s1 = ""
s2 = ""

for sentence in sentences:
    for sentence2 in sentences:
        if sentence is sentence2:
            continue
        similiarity = similar(sentence, sentence2)
        if (similiarity > currentSimilar):
            s1 = sentence
            s2 = sentence2
            currentSimilar = similiarity

print(s1, s2)

There might be some dynamic programming soultion to this problem if the performance is an issue.

import itertools

sentences = ["There is no subtle meaning in this.", "Don't analyze this!", "What is this sentence?"]
decomposedsentences = ((index, set(sentence.strip(".?!,").split(" "))) for index, sentence in enumerate(sentences))
s1,s2 = max(itertools.combinations(decomposedsentences, 2), key = lambda sentences: len(sentences[0][1]&sentences[1][1]))
print("The two sentences with the most common words", sentences[s1[0]], sentences[s2[0]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM