简体   繁体   中英

How can i write the name of text file before frequency of each word?

How can i write the text file name in each word frequency so that it first shows the fileno and then frequency of word in that file. for example: { like:['file1',2,'file2,'4'] } here like is the word that both file contains, i want to write file1 and file2 before their frequencies. It should be general for any number of files.

Here is my code

file_list = [open(file, 'r') for file in files] 
    num_files = len(file_list) 
    wordFreq = {}  
    for i, f in enumerate(file_list): 
        for line in f: 
            for word in line.lower().split():
                if not word in wordFreq:
                    wordFreq[word] = [0 for _ in range(num_files)]
                wordFreq[word][i] += 1

I know that my code is not pretty and not exactly what you want, but it is a solution. I would prefer using dictionary instead of a list structure like ['file1',2,'file2,'4']

Let's define 2 files as an example:

file1.txt:

this is an example

file2.txt:

this is an example
but multi line example

Here is the solution:

from collections import Counter

filenames = ["file1.txt", "file2.txt"]

# First, find word frequencies in files
file_dict = {}
for filename in filenames:
    with open(filename) as f:
        text = f.read()
    words = text.split()

    cnt = Counter()
    for word in words:
        cnt[word] += 1
    file_dict[filename] = dict(cnt)

print("file_dict: ", file_dict)

#Then, calculate frequencies in files for each word 
word_dict = {}
for filename, words in file_dict.items():
    for word, count in words.items():
        if word not in word_dict.keys():
            word_dict[word] = {filename: count}
        else:
            if filename not in word_dict[word].keys():
                word_dict[word][filename] = count    
            else:
                word_dict[word][filename] += count


print("word_dict: ", word_dict)

Output:

file_dict:  {'file1.txt': {'this': 1, 'is': 1, 'an': 1, 'example': 1}, 'file2.txt': {'this': 1, 'is': 1, 'an': 1, 'example': 2, 'but': 1, 'multi': 1, 'line': 1}}
word_dict:  {'this': {'file1.txt': 1, 'file2.txt': 1}, 'is': {'file1.txt': 1, 'file2.txt': 1}, 'an': {'file1.txt': 1, 'file2.txt': 1}, 'example': {'file1.txt': 1, 'file2.txt': 2}, 'but': {'file2.txt': 1}, 'multi': {'file2.txt': 1}, 'line': {'file2.txt': 1}}

This is a good use case for collections.Counter ; I suggest making a counter for each file.

from collections import Counter

def make_counter(filename):
    cnt = Counter()

    with open(filename) as f:
        for line in f:                # read line by line, is more performant for big files
            cnt.update(line.split())  # split line by whitespaces and updated word counts

    print(filename, cnt)
    return cnt

This function can be used for each file, making a dict that holds all the counters:

filename_list = ['f1.txt', 'f2.txt', 'f3.txt']
counter_dict = {                      # this will hold a counter for each file
    fn: make_counter(fn)
    for fn in filename_list}

Now a set can be used to get all the different words that appear in the files:

all_words = set(                      # this will hold all different words that appear
    word                              # in any of the files
    for cnt in counter_dict.values()
    for word in cnt.keys())

And these lines print each word and the count that word has in each file:

for word in sorted(all_words):
    print(word)
    for fn in filename_list:
        print('  {}: {}'.format(fn, counter_dict[fn][word]))

Obviously, you can adjust the printing to your specific needs, but this approach should allow you the flexibility you need.


If you rather have one dict with all the words as keys and their counts as values, you could try something like this:

all_words = {}

for fn, cnt in counter_dict.items():
    for word, n in cnt.items():
        all_words.setdefault(word, {}).setdefault(fn, 0)
        all_words[word][fn] += 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM