[英]How do I make each line in a text file its own dictionary to sort through in Python?
Currently, I have目前,我有
import re
import string
input_file = open('documents.txt', 'r')
stopwords_file = open('stopwords_en.txt', 'r')
stopwords_list = []
for line in stopwords_file.readlines():
stopwords_list.extend(line.split())
stopwords_set = set(stopwords_list)
word_count = {}
for line in input_file.readlines():
words = line.strip()
words = words.translate(str.maketrans('','', string.punctuation))
words = re.findall('\w+', line)
for word in words:
if word.lower() in stopwords_set:
continue
word = word.lower()
if not word in word_count:
word_count[word] = 1
else:
word_count[word] = word_count[word] + 1
word_index = sorted(word_count.keys())
for word in word_index:
print (word, word_count[word])
What it does is parses through a txt file I have, removes stopwords, and outputs the number of times a word appears in the document it is reading from.它所做的是解析我拥有的一个 txt 文件,删除停用词,并输出一个词在它正在读取的文档中出现的次数。
The problem is that the txt file is not one file, but five.问题是txt文件不是一个文件,而是五个。
The text in the document looks something like this:文档中的文本看起来像这样:
1
The cat in the hat was on the mat
2
The rat on the mat sat
3
The bat was fat and named Pat
Each "document" is a line preceded by the document ID number.
In Python, I want to find a way to go through 1, 2, and 3 and count how many times a word appears in an individual document, as well as the total amount of times a word appears in the whole text file - which my code currently does.在 Python 中,我想通过 1、2、3 找到一种方法到 go 并计算一个单词在单个文档中出现的次数,以及一个单词在整个文本文件中出现的总次数 - 这是我的代码目前确实如此。
ie Mat appears 2 times in the text document.即Mat在文本文档中出现了2次。 It appears in Document 1 and Document 2 Ideally less wordy.
它出现在 Document 1 和 Document 2 最好不要那么罗嗦。
Give this a try:试试这个:
import re
import string
def count_words(file_name):
word_count = {}
with open(file_name, 'r') as input_file:
for line in input_file:
if line.startswith("document"):
doc_id = line.split()[0]
words = line.strip().split()[1:]
for word in words:
word = word.translate(str.maketrans('','', string.punctuation)).lower()
if word in word_count:
word_count[word][doc_id] = word_count[word].get(doc_id, 0) + 1
else:
word_count[word] = {doc_id: 1}
return word_count
word_count = count_words("documents.txt")
for word, doc_count in word_count.items():
print(f"{word} appears in: {doc_count}")
You have deleted your previous similar question and with it my answer, so I'm not sure if it's a good idea to answer again.你已经删除了你之前的类似问题和我的回答,所以我不确定再次回答是否是个好主意。 I'll give a slightly different answer, without
groupby
, although I think it was fine.我会给出一个略有不同的答案,没有
groupby
,虽然我认为这很好。
You could try:你可以试试:
import re
from collections import Counter
from string import punctuation
with open("stopwords_en.txt", "r") as file:
stopwords = set().union(*(line.rstrip().split() for line in file))
translation = str.maketrans("", "", punctuation)
re_new_doc = re.compile(r"(\d+)\s*$")
with open("documents.txt", "r") as file:
word_count, doc_no = {}, 0
for line in file:
match = re_new_doc.match(line)
if match:
doc_no = int(match[1])
continue
line = line.translate(translation)
for word in re.findall(r"\w+", line):
word = word.casefold()
if word in stopwords:
continue
word_count.setdefault(word, []).append(doc_no)
word_count_overall = {word: len(docs) for word, docs in word_count.items()}
word_count_docs = {word: Counter(docs) for word, docs in word_count.items()}
(\d+)\s*$"
looks for digits at the beginning of a line and nothing else, except maybe some whitespace, until the line break. You have to adjust it if the identifier follows a different logic.(\d+)\s*$"
的正则表达式在行首查找数字,除了可能有一些空格外,没有其他内容,直到换行符。如果后面跟着标识符,则必须调整它不同的逻辑。word_count
records each occurrence of a word in a list with the number of the current document. word_count
记录一个单词在列表中的每次出现以及当前文档的编号。word_count_overall
just takes the length of the resp. word_count_overall
只占用 resp 的长度。 lists to get the overall count of a word.word_count_docs
does apply a Counter
on the lists to get the counts per document for each word. word_count_docs
确实在列表上应用了一个Counter
来获取每个文档的每个单词的计数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.