[英]How do I make each line in a text file its own dictionary to sort through in Python?
目前,我有
import re
import string
input_file = open('documents.txt', 'r')
stopwords_file = open('stopwords_en.txt', 'r')
stopwords_list = []
for line in stopwords_file.readlines():
stopwords_list.extend(line.split())
stopwords_set = set(stopwords_list)
word_count = {}
for line in input_file.readlines():
words = line.strip()
words = words.translate(str.maketrans('','', string.punctuation))
words = re.findall('\w+', line)
for word in words:
if word.lower() in stopwords_set:
continue
word = word.lower()
if not word in word_count:
word_count[word] = 1
else:
word_count[word] = word_count[word] + 1
word_index = sorted(word_count.keys())
for word in word_index:
print (word, word_count[word])
它所做的是解析我擁有的一個 txt 文件,刪除停用詞,並輸出一個詞在它正在讀取的文檔中出現的次數。
問題是txt文件不是一個文件,而是五個。
文檔中的文本看起來像這樣:
1
The cat in the hat was on the mat
2
The rat on the mat sat
3
The bat was fat and named Pat
Each "document" is a line preceded by the document ID number.
在 Python 中,我想通過 1、2、3 找到一種方法到 go 並計算一個單詞在單個文檔中出現的次數,以及一個單詞在整個文本文件中出現的總次數 - 這是我的代碼目前確實如此。
即Mat在文本文檔中出現了2次。 它出現在 Document 1 和 Document 2 最好不要那么羅嗦。
試試這個:
import re
import string
def count_words(file_name):
word_count = {}
with open(file_name, 'r') as input_file:
for line in input_file:
if line.startswith("document"):
doc_id = line.split()[0]
words = line.strip().split()[1:]
for word in words:
word = word.translate(str.maketrans('','', string.punctuation)).lower()
if word in word_count:
word_count[word][doc_id] = word_count[word].get(doc_id, 0) + 1
else:
word_count[word] = {doc_id: 1}
return word_count
word_count = count_words("documents.txt")
for word, doc_count in word_count.items():
print(f"{word} appears in: {doc_count}")
你已經刪除了你之前的類似問題和我的回答,所以我不確定再次回答是否是個好主意。 我會給出一個略有不同的答案,沒有groupby
,雖然我認為這很好。
你可以試試:
import re
from collections import Counter
from string import punctuation
with open("stopwords_en.txt", "r") as file:
stopwords = set().union(*(line.rstrip().split() for line in file))
translation = str.maketrans("", "", punctuation)
re_new_doc = re.compile(r"(\d+)\s*$")
with open("documents.txt", "r") as file:
word_count, doc_no = {}, 0
for line in file:
match = re_new_doc.match(line)
if match:
doc_no = int(match[1])
continue
line = line.translate(translation)
for word in re.findall(r"\w+", line):
word = word.casefold()
if word in stopwords:
continue
word_count.setdefault(word, []).append(doc_no)
word_count_overall = {word: len(docs) for word, docs in word_count.items()}
word_count_docs = {word: Counter(docs) for word, docs in word_count.items()}
(\d+)\s*$"
的正則表達式在行首查找數字,除了可能有一些空格外,沒有其他內容,直到換行符。如果后面跟着標識符,則必須調整它不同的邏輯。word_count
記錄一個單詞在列表中的每次出現以及當前文檔的編號。word_count_overall
只占用 resp 的長度。 列表以獲取單詞的總數。word_count_docs
確實在列表上應用了一個Counter
來獲取每個文檔的每個單詞的計數。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.