简体   繁体   中英

Python: Finding and counting exact and approximate matches of words in txt file

My program is close to doing what I want it to do, but I have one hangup: many of the keywords I'm trying to find might have symbols in the middle or might be misspelled. I would therefore like to count the words that are misspelled as keyword matches as if they word spelled correctly. For example, let's say my text says: "settlement settl#7*nt se##tl#ment ann&&ity annuity."

I want to count the times the.txt file has the keywords "settlement" and "annuity", but also words that begin with "sett" and end with "nt" as "settlement' and words that begin "ann" and end with "y" as annuity.

I've been able to count exact words and do pretty close to what I want it to do. But now I would like to do the approximate matches. I'm not even sure this is possible. Thanks.

out1 = open("seen.txt", "w")
out2 = open("missing.txt", "w")

def count_words_in_dir(dirpath, words, action=None):
    for filepath in glob.iglob(os.path.join("/Settlement", '*.txt')):
        with open(filepath) as f:
            data = f.read()
            for key, val in words.items():
                # print("key is " + key + "\n")
                ct = data.count(key)
                words[key] = ct
            if action:
                action(filepath, words)
            
                
                

def print_summary(filepath, words):
    for key, val in sorted(words.items()):
        whichout = out1 if val > 0 else out2
        print(filepath, file=whichout)
        print('{0}: {1}'.format(key, val), file=whichout)

filepath = sys.argv[1]
keys = ["annuity", "settlement"]
words = dict.fromkeys(keys, 0)

count_words_in_dir(filepath, words, action=print_summary)

out1.close()
out2.close()

For fuzzy matching you can use regex module, install it one time through pip install regex command.

Through this regex module you can use any expression and through {e<=2} suffix you can specify number of errors that can appear in the word to match regular expression (one error is either substitution or insertion or deletion of one symbol). This is also called edit distance or Levenshtein distance .

As an example I wrote my own function for counting words inside a given string. This function has num_errors param that specifies how many errors are alright for given word to match, I specified num_errors = 3, but you can set it to higher error rate, but don't set it to very high otherwise any word in text will match any reference word.

To split sentence into words I used re.split() .

Try it online!

import regex as re
def count_words(text, words, *, num_errors = 3):
    we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
    cnt = {e : 0 for e in words}
    for wt in re.split(r'[,.\s]+', text):
        for wre, wrt in zip(we, words):
            if re.fullmatch(wre, wt):
                cnt[wrt] += 1
                break
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

Output:

{'settlement': 3, 'annuity': 2}

As a faster alternative to regex module you can use Levenshtein module, install it once through pip install python-Levenshtein command.

This module implements only edit-distance (mentioned above) and should work much faster than regex module.

Same code as above but implemented using Levenshtein module is below:

Try it online!

import Levenshtein, re
def count_words(text, words, *, num_errors = 3):
    cnt = {e : 0 for e in words}
    for wt in re.split(r'[,.\s]+', text):
        for wr in words:
            if Levenshtein.distance(wr, wt) <= num_errors:
                cnt[wr] += 1
                break
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

Output:

{'settlement': 3, 'annuity': 2}

As requested by OP I'm implementing 3rd algorithm that doesn't use any re.split() for splitting into words, but uses re.finditer() instead.

Try it online!

import regex as re
def count_words(text, words, *, num_errors = 3):
    we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
    cnt = {e : 0 for e in words}
    for wre, wrt in zip(we, words):
        cnt[wrt] += len(list(re.finditer(wre, text)))
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

Output:

{'settlement': 3, 'annuity': 2}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM