简体   繁体   中英

Python: how to find the closest matching sentence from a txt file

I wanted to output if there is any similar sentence present in a txt file

Example:
If the .txt file contains

1 . What is the biggest planet of our Solar system?
2 . How to make tea?
3 . Which our Solar system's biggest planet?

In this case it should result:-
3 . Which our Solar system's biggest planet?

Basically it should compare if there is more than 4 or 5 words which is similar in the lines of the file

I agree with John Coleman's suggestion. difflib can help you find similarity metric between two string. Here's one of the possible approaches:

from difflib import SequenceMatcher

sentences = []
with open('./bp.txt', 'r') as f:
    for line in f:
        # only consider lines that have numbers at the beginning
        if line.split('.')[0].isdigit():
            sentences.append(line.split('\n')[0])
max_prob = 0
similar_sentence = None
length = len(sentences)
for i in range(length):
    for j in range(i+1,length):
        match_ratio = SequenceMatcher(None, sentences[i], sentences[j]).ratio()
        if  match_ratio > max_prob:
            max_prob = match_ratio
            similar_sentence = sentences[j]
if similar_sentence is not None:
    print(similar_sentence)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM