简体   繁体   中英

Loop through dictionary and get the 7 most common words. BUT only if the words aren't found in another list

I am learning some basic python 3 and have been stuck at this problem for 2 days now and i can't seem to get anywhere...
Been reading the "think python" book and I'm working on chapter 13 and the case study it contains. The chapter is all about reading a file and doing some magic with it like counting total number of words and the most used words.
One part of the program is about "Dictionary subtraction" where the program fetches all the word from one textfile that are not found in another textfile .

What I also need the program to do is count the most common word from the first file, excluding the words found in the "dictionary" text file. This functionality has had me stuck for two days and i don't really know how to solve this...

The Code to my program is as follow:

import string

def process_file(filename):
  hist = {}
  fp = open(filename)


for line in fp:
    process_line(line, hist)
return hist


def process_line(line, hist):
    line = line.replace('-', ' ')

    for word in line.split():
        word = word.strip(string.punctuation + string.whitespace)
        word = word.lower()

        hist[word] = hist.get(word, 0) + 1


def most_common(hist):
    t = []
    for key, value in hist.items():
        t.append((value, key))

    t.sort()
    t.reverse()
    return t


def subtract(d1, d2):
    res = {}
    for key in d1:
        if key not in d2:
            res[key] = None
    return res


hist = process_file('alice-ch1.txt')
words = process_file('common-words.txt')
diff = subtract(hist, words)


def total_words(hist):
    return sum(hist.values())


def different_words(hist):
    return len(hist)


if __name__ == '__main__':

print ('Total number of words:', total_words(hist))
print ('Number of different words:', different_words(hist))

t = most_common(hist)
print ('The most common words are:')
for freq, word in t[0:7]:
    print (word, '\t', freq)
print("The words in the book that aren't in the word list are:")
for word in diff.keys():
    print(word)

I then created a test dict containing a few words and imaginary times they occur and a test list to try and solve my problem and the code for that is:

histfake = {'hello': 12, 'removeme': 2, 'hi': 3, 'fish':250, 'chicken':55, 'cow':10, 'bye':20, 'the':93, 'she':79, 'to':75}
listfake =['removeme', 'fish']

newdict = {}
for key, val in histfake.items():
    for commonword in listfake:
        if key != commonword:
            newdict[key] = val
        else:
            newdict[key] = 0

sortcommongone = []
for key, value in newdict.items():
    sortcommongone.append((value, key))
sortcommongone.sort()
sortcommongone.reverse()

for freq, word in sortcommongone:
    print(word, '\t', freq)

The problem is that that code only works for one word. Only one matched word between the dict and the list gets the value of 0 (thought that I could give the duplicate words the vale 0 since I only need the 7 most common words that are not found in the common-word text file.
How can I solve this? Created a account here just to try and get some help with this since Stackowerflow has helped me before with other problems. But this time I needed to ask the question myself. Thanks!

You can filter out the items using a dict comprehension

>>> {key: value for key, value in histfake.items() if key not in listfake}
{'hi': 3, 'she': 79, 'to': 75, 'cow': 10, 'bye': 20, 'chicken': 55, 'the': 93, 'hello': 12}

Unless listfake is larger than histfake ,the most efficient way will be to delete keys in it listfake

for key in listfake:
    del histfake[key]

Complexity of list comprehension and this solution is O(n)- but the list is supposedly much shorter than the dictionary.

EDIT: Or it may be done - if you have more keys than actual words - for key in histfake: if key in listfake: del histfake[key] You may want to test run time

Then, of course, you'll have to sort dictionary into list - and recreate it

from operator import itemgetter    
most_common_7 = dict(sorted(histfake.items(), key=itemgetter(1))[:7])

BTW, you may use Counter from Collections to count words. And maybe part of your problem is that you don't remove all non-letter characters from your text

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM