简体   繁体   中英

python - find matching sentences in file

I've a text file which contains 35k words in paragraphs. Sample below

This sentence does repeat? This sentence does not repeat! This sentence does not repeat. This sentence does repeat.
This sentence does repeat. This sentence does not repeat! This sentence does not repeat. This sentence does repeat!

I wanted to identify matching sentences. One way I managed to find is to split the paragraphs into separate lines using . , ! , ? etc. as the delimiter's and look for matching lines.

Code

import collections as col

with open('txt.txt', 'r') as f:
    l = f.read().replace('. ','.\n').replace('? ','?\n').replace('! ','!\n').splitlines()
print([i for i, n in col.Counter(l).items() if n > 1])

Please suggest some better approaches.

You can use split :

import re
...
l = re.split(r'[?!.]*',f.read())

I cannot guarentee it would be the fastest, but I would try to exploit the speed of sort . First I would split the text by punctuation to give a list of sentances, then run sort on the list to order all the sentances, then finally loop through the list and count the number of consecutive sentances that are the same and store the sentance and count in a dict.

You can do it a different. The regex module is very powerful:

import re
from collections import Counter

pat = r'(\?)|(\.)|(!)'
c = Counter()
with open('filename') as f:
       for line in f:
              c[re.sub(pat, '\n', line)] += 1

This creates a regex pattern matching ?, . or ! ?, . or ! and replaces it with a \\n . Using the for loop, this happens on a line basis.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM