简体   繁体   English

如何使文本文件中的每一行都有自己的字典,以便在 Python 中进行排序?

[英]How do I make each line in a text file its own dictionary to sort through in Python?

Currently, I have目前,我有

import re 
import string

input_file = open('documents.txt', 'r')
stopwords_file = open('stopwords_en.txt', 'r')
stopwords_list = []

for line in stopwords_file.readlines():
  stopwords_list.extend(line.split())

stopwords_set = set(stopwords_list)

word_count = {}
for line in input_file.readlines():
    words = line.strip()
    words = words.translate(str.maketrans('','', string.punctuation))
    words = re.findall('\w+', line)
    for word in words: 
      if word.lower() in stopwords_set:
        continue
      word = word.lower()
      if not word in word_count: 
        word_count[word] = 1
      else: 
        word_count[word] = word_count[word] + 1

word_index = sorted(word_count.keys())
for word in word_index:
  print (word, word_count[word]) 

What it does is parses through a txt file I have, removes stopwords, and outputs the number of times a word appears in the document it is reading from.它所做的是解析我拥有的一个 txt 文件,删除停用词,并输出一个词在它正在读取的文档中出现的次数。

The problem is that the txt file is not one file, but five.问题是txt文件不是一个文件,而是五个。

The text in the document looks something like this:文档中的文本看起来像这样:

1 
The cat in the hat was on the mat

2 
The rat on the mat sat

3
The bat was fat and named Pat
Each "document" is a line preceded by the document ID number.

In Python, I want to find a way to go through 1, 2, and 3 and count how many times a word appears in an individual document, as well as the total amount of times a word appears in the whole text file - which my code currently does.在 Python 中,我想通过 1、2、3 找到一种方法到 go 并计算一个单词在单个文档中出现的次数,以及一个单词在整个文本文件中出现的总次数 - 这是我的代码目前确实如此。

ie Mat appears 2 times in the text document.即Mat在文本文档中出现了2次。 It appears in Document 1 and Document 2 Ideally less wordy.它出现在 Document 1 和 Document 2 最好不要那么罗嗦。

Give this a try:试试这个:

import re
import string

def count_words(file_name):
    word_count = {}
    with open(file_name, 'r') as input_file:
        for line in input_file:
            if line.startswith("document"):
                doc_id = line.split()[0]
                words = line.strip().split()[1:]
                for word in words:
                    word = word.translate(str.maketrans('','', string.punctuation)).lower()
                    if word in word_count:
                        word_count[word][doc_id] = word_count[word].get(doc_id, 0) + 1
                    else:
                        word_count[word] = {doc_id: 1}
    return word_count

word_count = count_words("documents.txt")
for word, doc_count in word_count.items():
    print(f"{word} appears in: {doc_count}")

You have deleted your previous similar question and with it my answer, so I'm not sure if it's a good idea to answer again.你已经删除了你之前的类似问题和我的回答,所以我不确定再次回答是否是个好主意。 I'll give a slightly different answer, without groupby , although I think it was fine.我会给出一个略有不同的答案,没有groupby ,虽然我认为这很好。

You could try:你可以试试:

import re
from collections import Counter
from string import punctuation

with open("stopwords_en.txt", "r") as file:
    stopwords = set().union(*(line.rstrip().split() for line in file))
translation = str.maketrans("", "", punctuation)
re_new_doc = re.compile(r"(\d+)\s*$")
with open("documents.txt", "r") as file:
    word_count, doc_no = {}, 0
    for line in file:
        match = re_new_doc.match(line)
        if match:
            doc_no = int(match[1])
            continue
        line = line.translate(translation)
        for word in re.findall(r"\w+", line):
            word = word.casefold()
            if word in stopwords:
                continue
            word_count.setdefault(word, []).append(doc_no)

word_count_overall = {word: len(docs) for word, docs in word_count.items()}
word_count_docs = {word: Counter(docs) for word, docs in word_count.items()}
  • I would make the translation table only once, beforehand, not for each line again.我会事先只制作一次翻译表,而不是为每一行制作一次。
  • The regex for the identification of a new document (\d+)\s*$" looks for digits at the beginning of a line and nothing else, except maybe some whitespace, until the line break. You have to adjust it if the identifier follows a different logic.用于识别新文档(\d+)\s*$"的正则表达式在行首查找数字,除了可能有一些空格外,没有其他内容,直到换行符。如果后面跟着标识符,则必须调整它不同的逻辑。
  • word_count records each occurrence of a word in a list with the number of the current document. word_count记录一个单词在列表中的每次出现以及当前文档的编号。
  • word_count_overall just takes the length of the resp. word_count_overall只占用 resp 的长度。 lists to get the overall count of a word.列表以获取单词的总数。
  • word_count_docs does apply a Counter on the lists to get the counts per document for each word. word_count_docs确实在列表上应用了一个Counter来获取每个文档的每个单词的计数。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中将文本文件的每个值移至其各自的行 - Moving each value of a text file to its own line in Python 如何将每行包含2个项目的文本文件转换为python中的字典? - How do I turn a text file with 2 items on each line into a dictionary in python? 如何在 Python 中将文本文件(每个单独的项目在自己的行上)读入 2D 列表 - How to read a text file (each individual item on its own line) into a 2D list in Python 如何使每条推文都在自己的行上? - How to make each tweet on its own line? 如何在Python 2.7.2中获取一串数字并使每个数字成为列表中自己的元素? - How do I take a string of numbers in Python 2.7.2 and make each number its own element in a list? 给定一个文本文件,如何在Python中将其制成字典? - Given a text file, how do I make it into a dictionary in Python? Python - 如何在文本文件中制作字典? - Python - How do I make a dictionary inside of a text file? 如何使用每个新段落的第一行中的键从按段落分隔的文本文件在python中制作字典? - How to make a dictionary in python from a text file seperated by paragraph with the key in the first line of each new paragraph? 如何将文本文件的每一行转换为字典条目? - How do I convert each line of my text file into a dictionary entry? Python:读取一个文本文件(每行都是字典结构)到字典 - Python: read a text file (each line is of dictionary structure) to a dictionary
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM