[英]in Python, how can i write from a text file with each word being a different element?
[英]How can i write the name of text file before frequency of each word?
我如何在每個單詞頻率中寫入文本文件名,以便它首先顯示fileno,然后顯示該文件中單詞的頻率。 例如:{like:['file1',2,'file2,'4']}這里是兩個文件都包含的單詞,我想在它們的頻率之前寫入file1和file2。 對於任何數量的文件,它應該是通用的。
這是我的代碼
file_list = [open(file, 'r') for file in files]
num_files = len(file_list)
wordFreq = {}
for i, f in enumerate(file_list):
for line in f:
for word in line.lower().split():
if not word in wordFreq:
wordFreq[word] = [0 for _ in range(num_files)]
wordFreq[word][i] += 1
我知道我的代碼不是很漂亮,也不完全是您想要的,但這是一個解決方案。 我更喜歡使用字典而不是像['file1',2,'file2,'4']
這樣的列表結構
讓我們定義2個文件作為示例:
FILE1.TXT:
this is an example
FILE2.TXT:
this is an example
but multi line example
解決方法如下:
from collections import Counter
filenames = ["file1.txt", "file2.txt"]
# First, find word frequencies in files
file_dict = {}
for filename in filenames:
with open(filename) as f:
text = f.read()
words = text.split()
cnt = Counter()
for word in words:
cnt[word] += 1
file_dict[filename] = dict(cnt)
print("file_dict: ", file_dict)
#Then, calculate frequencies in files for each word
word_dict = {}
for filename, words in file_dict.items():
for word, count in words.items():
if word not in word_dict.keys():
word_dict[word] = {filename: count}
else:
if filename not in word_dict[word].keys():
word_dict[word][filename] = count
else:
word_dict[word][filename] += count
print("word_dict: ", word_dict)
輸出:
file_dict: {'file1.txt': {'this': 1, 'is': 1, 'an': 1, 'example': 1}, 'file2.txt': {'this': 1, 'is': 1, 'an': 1, 'example': 2, 'but': 1, 'multi': 1, 'line': 1}}
word_dict: {'this': {'file1.txt': 1, 'file2.txt': 1}, 'is': {'file1.txt': 1, 'file2.txt': 1}, 'an': {'file1.txt': 1, 'file2.txt': 1}, 'example': {'file1.txt': 1, 'file2.txt': 2}, 'but': {'file2.txt': 1}, 'multi': {'file2.txt': 1}, 'line': {'file2.txt': 1}}
這是collections.Counter
好用例; 我建議為每個文件做一個計數器。
from collections import Counter
def make_counter(filename):
cnt = Counter()
with open(filename) as f:
for line in f: # read line by line, is more performant for big files
cnt.update(line.split()) # split line by whitespaces and updated word counts
print(filename, cnt)
return cnt
該函數可用於每個文件,從而形成一個包含所有計數器的dict
:
filename_list = ['f1.txt', 'f2.txt', 'f3.txt']
counter_dict = { # this will hold a counter for each file
fn: make_counter(fn)
for fn in filename_list}
現在,可以使用一個set
來獲取出現在文件中的所有不同單詞:
all_words = set( # this will hold all different words that appear
word # in any of the files
for cnt in counter_dict.values()
for word in cnt.keys())
這些行將打印每個單詞以及每個文件中單詞的計數:
for word in sorted(all_words):
print(word)
for fn in filename_list:
print(' {}: {}'.format(fn, counter_dict[fn][word]))
顯然,您可以根據自己的特定需求調整打印,但是這種方法應該可以為您提供所需的靈活性。
如果您寧願有一個dict
,所有的單詞都作為鍵,而它們的數量則作為值,則可以嘗試如下操作:
all_words = {}
for fn, cnt in counter_dict.items():
for word, n in cnt.items():
all_words.setdefault(word, {}).setdefault(fn, 0)
all_words[word][fn] += 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.