简体   繁体   中英

How to sort a dictionary by relative word frequency in two txt files

I'm trying to write some code to read two separate text files, filter out common words, calculate frequency of words in each file, and finally output in order of relative frequency between the two lists. Therefore ideal output is words relatively more frequent in file 1 should appear at the top of the list, the words relatively more frequent in file 2 should appear at the bottom of the list, and those words that appear in both should be in the middle. For example:

word, freq file 1, freq file 2    
Cat 5,0    
Dog 4,0    
Mouse 2,2    
Carrot 1,4    
Lettuce 0,5    
​

My code currently outputs the words in order of their frequency in file 1, but I cant figure out how to arrange the list so that it the words more common in file 2 appear at the bottom of the list. I get that I need to subtract the frequency of words in file 1 from the frequency of same words in file 2, but I cant figure out how to operate on the tuple in the dictionary...

Please help!

import re

f1=open('file1.txt','r', encoding="utf-8") #file 1
f2=open('file2.txt','r', encoding="utf-8") #file 2

file_list = [f1, f2] # This will hold all the files

num_files = len(file_list)

stopwords = ["a", "and", "the", "i", "of", "this", "it", "but", "is", "in", "im", "my", "to", "for", "as", "on", "helpful", "comment", "report", "stars", "reviewed", "united", "kingdom", "was", "with", "-", "it", "not", "about", "which", "so", "at", "out", "abuse", "than","any", "if", "be", "can", "its", "customer", "dont", "just", "other", "too", "only", "people", "found", "helpful", "have", "wasnt", "purchase", "do", "only", "bought", "etc", "verified", "", "wasnt", "thanks", "thanx", "could", "think", "your", "thing", "much", "ive", "you", "they", "vine", "had", "more", "that"]

frequencies = {} # One dictionary to hold the frequencies

for i, f in enumerate(file_list):   # Loop over the files, keeping an index i
    for line in f:                                      # Get the lines of that file
        for word in line.split():           # Get the words of that file
            word = re.sub(r'[^\w\s]','',word) # Strip punctuation
            word = word.lower()                     # Make lowercase
            if not word in stopwords:           # Remove stopwords
                    if not word.isdigit():      # Ignore digits
                        if not word in frequencies:
                            frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word not seen yet -- one 0 for each file

                        frequencies[word][i] += 1   # Increment the frequency count for that word and file

frequency_sorted = sorted(frequencies, key=frequencies.get, reverse=True)
for r in frequency_sorted:
    print (r, frequencies[r])

You overcomplicate things. This should help you:

import strings
from collections import Counter

def get_freqs( name ) :
    with open(name) as fin :
        text = fin.read().lower()

    words = ''.join( i if i in strings.ascii_letters else ' ' for i in text )
    words = [w for w in words.split() if len(w) > 0]
    return Counter( words )

freqs1 = get_freqs( 'file1.txt' )
freqs2 = get_freqs( 'file2.txt' )

all_words = set(freqs1.keys()) | set(freqs2.keys())  # - set(stop_words) ?
freqs_sorted = sorted( (freqs1[w], freqs2[w], w) for w in all_words )

If you worry about the stopwords, you may change all_words = set(freqs1.keys()) | set(freqs2.keys()) all_words = set(freqs1.keys()) | set(freqs2.keys()) into all_words = set(freqs1.keys()) | set(freqs2.keys()) - set(stop_words) all_words = set(freqs1.keys()) | set(freqs2.keys()) - set(stop_words) or something similar.

May be best if the key argument to sorted points to a function that returns simply an integer which is the difference in frequencies in the two files. Here is a complete solution which incorporates that.

import re

class FrequencyComparer:

    stopwords = {"a", "and", "the", "i", "of", "this", "it", "but", "is", "in", "im", "my", "to", "for", "as", "on", "helpful", "comment", "report", "stars", "reviewed", "united", "kingdom", "was", "with", "-", "it", "not", "about", "which", "so", "at", "out", "abuse", "than","any", "if", "be", "can", "its", "customer", "dont", "just", "other", "too", "only", "people", "found", "helpful", "have", "wasnt", "purchase", "do", "only", "bought", "etc", "verified", "", "wasnt", "thanks", "thanx", "could", "think", "your", "thing", "much", "ive", "you", "they", "vine", "had", "more", "that"}

    def __init__(self, file1, file2):
        self.freq1 = self.get_freqs_from_file(file1)
        self.freq2 = self.get_freqs_from_file(file2)

    def get_freqs_from_file(self, filename):
        matcher = re.compile("[^\w]*([\w']*)[^\w]*$").match
        freqs = {}
        with open(filename) as f:
            for line in f:
                for word in line.split():
                    m = matcher(word.lower())
                    if m:
                        w = m.group(1)
                        if w not in self.stopwords:
                            freqs[w] = freqs.get(w, 0) + 1
        return freqs

    def get_freqs_for_word(self, word):
        return (self.freq1.get(word, 0), self.freq2.get(word, 0))

    def get_relative_freq(self, word):
        freqs = self.get_freqs_for_word(word)
        return freqs[1] - freqs[0]

    def get_all_words(self):
        return set(set(self.freq1.keys()) | set(self.freq2.keys()))

    def get_all_words_by_relative_freq(self):
        all_words = self.get_all_words()
        return sorted(all_words, key=self.get_relative_freq)


fc = FrequencyComparer("file1.txt", "file2.txt")

for word in fc.get_all_words_by_relative_freq():
    print(word, fc.get_freqs_for_word(word))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM