如何使文本文件中的每一行都有自己的字典，以便在 Python 中进行排序？

Question

Currently, I have目前，我有

import re 
import string

input_file = open('documents.txt', 'r')
stopwords_file = open('stopwords_en.txt', 'r')
stopwords_list = []

for line in stopwords_file.readlines():
  stopwords_list.extend(line.split())

stopwords_set = set(stopwords_list)

word_count = {}
for line in input_file.readlines():
    words = line.strip()
    words = words.translate(str.maketrans('','', string.punctuation))
    words = re.findall('\w+', line)
    for word in words: 
      if word.lower() in stopwords_set:
        continue
      word = word.lower()
      if not word in word_count: 
        word_count[word] = 1
      else: 
        word_count[word] = word_count[word] + 1

word_index = sorted(word_count.keys())
for word in word_index:
  print (word, word_count[word])

What it does is parses through a txt file I have, removes stopwords, and outputs the number of times a word appears in the document it is reading from.它所做的是解析我拥有的一个 txt 文件，删除停用词，并输出一个词在它正在读取的文档中出现的次数。

The problem is that the txt file is not one file, but five.问题是txt文件不是一个文件，而是五个。

The text in the document looks something like this:文档中的文本看起来像这样：

1 
The cat in the hat was on the mat

2 
The rat on the mat sat

3
The bat was fat and named Pat
Each "document" is a line preceded by the document ID number.

In Python, I want to find a way to go through 1, 2, and 3 and count how many times a word appears in an individual document, as well as the total amount of times a word appears in the whole text file - which my code currently does.在 Python 中，我想通过 1、2、3 找到一种方法到 go 并计算一个单词在单个文档中出现的次数，以及一个单词在整个文本文件中出现的总次数 - 这是我的代码目前确实如此。

ie Mat appears 2 times in the text document.即Mat在文本文档中出现了2次。 It appears in Document 1 and Document 2 Ideally less wordy.它出现在 Document 1 和 Document 2 最好不要那么罗嗦。

Answer 1

Give this a try:试试这个：

import re
import string

def count_words(file_name):
    word_count = {}
    with open(file_name, 'r') as input_file:
        for line in input_file:
            if line.startswith("document"):
                doc_id = line.split()[0]
                words = line.strip().split()[1:]
                for word in words:
                    word = word.translate(str.maketrans('','', string.punctuation)).lower()
                    if word in word_count:
                        word_count[word][doc_id] = word_count[word].get(doc_id, 0) + 1
                    else:
                        word_count[word] = {doc_id: 1}
    return word_count

word_count = count_words("documents.txt")
for word, doc_count in word_count.items():
    print(f"{word} appears in: {doc_count}")

Answer 2

You have deleted your previous similar question and with it my answer, so I'm not sure if it's a good idea to answer again.你已经删除了你之前的类似问题和我的回答，所以我不确定再次回答是否是个好主意。 I'll give a slightly different answer, without groupby , although I think it was fine.我会给出一个略有不同的答案，没有groupby ，虽然我认为这很好。

You could try:你可以试试：

import re
from collections import Counter
from string import punctuation

with open("stopwords_en.txt", "r") as file:
    stopwords = set().union(*(line.rstrip().split() for line in file))
translation = str.maketrans("", "", punctuation)
re_new_doc = re.compile(r"(\d+)\s*$")
with open("documents.txt", "r") as file:
    word_count, doc_no = {}, 0
    for line in file:
        match = re_new_doc.match(line)
        if match:
            doc_no = int(match[1])
            continue
        line = line.translate(translation)
        for word in re.findall(r"\w+", line):
            word = word.casefold()
            if word in stopwords:
                continue
            word_count.setdefault(word, []).append(doc_no)

word_count_overall = {word: len(docs) for word, docs in word_count.items()}
word_count_docs = {word: Counter(docs) for word, docs in word_count.items()}

I would make the translation table only once, beforehand, not for each line again.我会事先只制作一次翻译表，而不是为每一行制作一次。
The regex for the identification of a new document (\d+)\s*$" looks for digits at the beginning of a line and nothing else, except maybe some whitespace, until the line break. You have to adjust it if the identifier follows a different logic.用于识别新文档(\d+)\s*$"的正则表达式在行首查找数字，除了可能有一些空格外，没有其他内容，直到换行符。如果后面跟着标识符，则必须调整它不同的逻辑。
word_count records each occurrence of a word in a list with the number of the current document. word_count记录一个单词在列表中的每次出现以及当前文档的编号。
word_count_overall just takes the length of the resp. word_count_overall只占用 resp 的长度。 lists to get the overall count of a word.列表以获取单词的总数。
word_count_docs does apply a Counter on the lists to get the counts per document for each word. word_count_docs确实在列表上应用了一个Counter来获取每个文档的每个单词的计数。

如何使文本文件中的每一行都有自己的字典，以便在 Python 中进行排序？

问题描述

2 个解决方案

解决方案1
0 2023-01-27 02:35:28

解决方案2
0 2023-01-27 19:23:46

如何使文本文件中的每一行都有自己的字典，以便在 Python 中进行排序？

问题描述

2 个解决方案

解决方案1 0 2023-01-27 02:35:28

解决方案2 0 2023-01-27 19:23:46

解决方案1
0 2023-01-27 02:35:28

解决方案2
0 2023-01-27 19:23:46