简体   繁体   English

Python - 查找可能拼写错误的词频和字符串频率并保存为 txt 文件或 CSV

[英]Python - Finding word frequencies and string frequency with possible misspellings and saving as txt file or CSV

What I'm trying to do is scrape text from pretty messy text files for specific words that sometimes have misspellings or characters that don't belong.我想要做的是从非常混乱的文本文件中抓取特定单词的文本,这些单词有时有拼写错误或不属于的字符。 I have been able to accomplish single words with exact spellings across multiple files in a directory, which is close, but not exactly what I'm looking for.我已经能够在一个目录中的多个文件中完成具有精确拼写的单个单词,这很接近,但不完全是我想要的。 The last thing is that I want to save this list with the counts of words and phrases into a text file, and not just print it as a summary, which is what my code does now.最后一件事是我想将这个包含单词和短语计数的列表保存到一个文本文件中,而不仅仅是将其打印为摘要,这就是我的代码现在所做的。

If it's not possible to find close matches, that's okay, but that would be ideal.如果无法找到接近的匹配项,那没关系,但这将是理想的。

Thanks for your help.谢谢你的帮助。

import os
from collections import Counter
import glob

def word_frequency(fileobj, words):
    """Build a Counter of specified words in fileobj"""
    # initialise the counter to 0 for each word
    ct = Counter(dict((w, 0) for w in words))
    file_words = (word for line in fileobj for word in line.split())
    filtered_words = (word for word in file_words if word in words)
    return Counter(filtered_words)


def count_words_in_dir(dirpath, words, action=None):
    """For each .txt file in a dir, count the specified words"""
    for filepath in glob.iglob(os.path.join(path, '*.txt')):
        with open(filepath) as f:
            ct = word_frequency(f, words)
            if action:
                action(filepath, ct)


def print_summary(filename, ct):
    words = sorted(ct.keys())
    counts = [str(ct[k]) for k in words]
    print('{0}\n{1}\n{2}\n\n'.format(
        filepath,
        ', '.join(words),
        ', '.join(counts)))


words = set(['JUSTICE', "policy payment", "payment", "annuity", "CYNTHEA" ])
count_words_in_dir('./', words, action=print_summary)
import sys
import os
from collections import Counter
import glob
# def count_words_in_dir(dirpath, words, action=None):
#     """For each .txt file in a dir, count the specified words"""
#     for filepath in glob.iglob(os.path.join(path, '*.txt')):
#         with open(filepath) as f:
#             data = f.read()
#             for key,val in words.items():
#                 print("key is " + key + "\n")
#                 ct = data.count(key)
#                 words[key] = ct
#             if action:
#                 action(filepath, ct)
stdoutOrigin=sys.stdout 
sys.stdout = open("log.txt", "w")
              
def count_words_in_dir(dirpath, words, action=None):
    for filepath in glob.iglob(os.path.join("path", '*.txt')):
        with open(filepath) as f:
            data = f.read()
            for key,val in words.items():
                #print("key is " + key + "\n")
                ct = data.count(key)
                words[key] = ct
            if action:
                 action(filepath, words)


def print_summary(filepath, words):
    print(filepath)
    for key,val in sorted(words.items()):
        print('{0}:\t{1}'.format(
            key,
            val))




filepath = sys.argv[1]
keys = ["keyword",
"keyword"]
words = dict.fromkeys(keys,0)

count_words_in_dir(filepath, words, action=print_summary)

sys.stdout.close()
sys.stdout=stdoutOrigin

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM