简体   繁体   English

Python:查找和计算 txt 文件中单词的精确匹配和近似匹配

[英]Python: Finding and counting exact and approximate matches of words in txt file

My program is close to doing what I want it to do, but I have one hangup: many of the keywords I'm trying to find might have symbols in the middle or might be misspelled.我的程序已经接近完成我想要它做的事情,但我有一个挂断:我试图找到的许多关键字可能在中间有符号或者可能拼写错误。 I would therefore like to count the words that are misspelled as keyword matches as if they word spelled correctly.因此,我想将拼写错误的单词算作关键字匹配,就好像它们拼写正确一样。 For example, let's say my text says: "settlement settl#7*nt se##tl#ment ann&&ity annuity."例如,假设我的文本说:“settlement settl#7*nt se##tl#ment ann&&ity annuity。”

I want to count the times the.txt file has the keywords "settlement" and "annuity", but also words that begin with "sett" and end with "nt" as "settlement' and words that begin "ann" and end with "y" as annuity.我想计算.txt文件有关键字“settlement”和“annuity”的次数,还有以“sett”开头并以“nt”结尾的单词作为“settlement”以及以“ann”开头并以结尾的单词“y”作为年金。

I've been able to count exact words and do pretty close to what I want it to do.我已经能够计算出确切的单词并且非常接近我想要它做的事情。 But now I would like to do the approximate matches.但现在我想做近似匹配。 I'm not even sure this is possible.我什至不确定这是可能的。 Thanks.谢谢。

out1 = open("seen.txt", "w")
out2 = open("missing.txt", "w")

def count_words_in_dir(dirpath, words, action=None):
    for filepath in glob.iglob(os.path.join("/Settlement", '*.txt')):
        with open(filepath) as f:
            data = f.read()
            for key, val in words.items():
                # print("key is " + key + "\n")
                ct = data.count(key)
                words[key] = ct
            if action:
                action(filepath, words)
            
                
                

def print_summary(filepath, words):
    for key, val in sorted(words.items()):
        whichout = out1 if val > 0 else out2
        print(filepath, file=whichout)
        print('{0}: {1}'.format(key, val), file=whichout)

filepath = sys.argv[1]
keys = ["annuity", "settlement"]
words = dict.fromkeys(keys, 0)

count_words_in_dir(filepath, words, action=print_summary)

out1.close()
out2.close()

For fuzzy matching you can use regex module, install it one time through pip install regex command.对于模糊匹配,您可以使用正则表达式模块,通过pip install regex命令安装一次。

Through this regex module you can use any expression and through {e<=2} suffix you can specify number of errors that can appear in the word to match regular expression (one error is either substitution or insertion or deletion of one symbol).通过这个正则表达式模块,您可以使用任何表达式,并且通过{e<=2}后缀,您可以指定单词中出现的错误数以匹配正则表达式(一个错误是替换或插入或删除一个符号)。 This is also called edit distance or Levenshtein distance .这也称为编辑距离或Levenshtein 距离

As an example I wrote my own function for counting words inside a given string.例如,我编写了自己的 function 来计算给定字符串中的单词。 This function has num_errors param that specifies how many errors are alright for given word to match, I specified num_errors = 3, but you can set it to higher error rate, but don't set it to very high otherwise any word in text will match any reference word.这个 function 有num_errors参数,它指定给定单词匹配多少错误是正确的,我指定了 num_errors = 3,但您可以将其设置为更高的错误率,但不要将其设置为非常高,否则文本中的任何单词都会匹配任何参考词。

To split sentence into words I used re.split() .要将句子拆分成单词,我使用re.split()

Try it online! 在线尝试!

import regex as re
def count_words(text, words, *, num_errors = 3):
    we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
    cnt = {e : 0 for e in words}
    for wt in re.split(r'[,.\s]+', text):
        for wre, wrt in zip(we, words):
            if re.fullmatch(wre, wt):
                cnt[wrt] += 1
                break
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

Output: Output:

{'settlement': 3, 'annuity': 2}

As a faster alternative to regex module you can use Levenshtein module, install it once through pip install python-Levenshtein command.作为 regex 模块的更快替代方案,您可以使用Levenshtein模块,通过pip install python-Levenshtein命令安装一次。

This module implements only edit-distance (mentioned above) and should work much faster than regex module.这个模块只实现了编辑距离(上面提到过)并且应该比正则表达式模块工作得快得多。

Same code as above but implemented using Levenshtein module is below:与上面相同但使用 Levenshtein 模块实现的代码如下:

Try it online! 在线尝试!

import Levenshtein, re
def count_words(text, words, *, num_errors = 3):
    cnt = {e : 0 for e in words}
    for wt in re.split(r'[,.\s]+', text):
        for wr in words:
            if Levenshtein.distance(wr, wt) <= num_errors:
                cnt[wr] += 1
                break
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

Output: Output:

{'settlement': 3, 'annuity': 2}

As requested by OP I'm implementing 3rd algorithm that doesn't use any re.split() for splitting into words, but uses re.finditer() instead.根据 OP 的要求,我正在实现第三种算法,它不使用任何re.split()来拆分成单词,而是使用re.finditer()代替。

Try it online! 在线尝试!

import regex as re
def count_words(text, words, *, num_errors = 3):
    we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
    cnt = {e : 0 for e in words}
    for wre, wrt in zip(we, words):
        cnt[wrt] += len(list(re.finditer(wre, text)))
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

Output: Output:

{'settlement': 3, 'annuity': 2}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM