简体   繁体   中英

how to find most common entry in dictionary of dictionaries in python

I have a dictionary that's two levels deep. That is, each key in the first dictionary is a url and the value is another dictionary with each key being words and each value being the number of times the word appeared on that url. It looks something like this:

dic = {
    'http://www.cs.rpi.edu/news/seminars.html': {
        'hyper': 1,
        'summer': 2,
        'expert': 1,
        'koushk': 1,
        'semantic': 1,
        'feedback': 1,
        'sandia': 1,
        'lewis': 1,
        'global': 1,
        'yener': 1,
        'laura': 1,
        'troy': 1,
        'session': 1,
        'greenhouse': 1,
        'human': 1

...and so on...

The dictionary itself is very long and has 25 urls in it, each url having another dictionary as its value with every word found within the url and the number of times its found.

I want to find the word or words that appear in the most different urls in the dictionary. So the output should look something like this:

The following words appear x times on y pages: list of words

It seems that you should use a Counter for this:

from collections import Counter
print sum((Counter(x) for x in dic.values()),Counter()).most_common()

Or the multiline version:

c = Counter()
for d in dic.values():
    c += Counter(d)

print c.most_common()

To get the words which are common in all of the subdicts:

subdicts = iter(dic.values())
s = set(next(subdicts)).intersection(*subdicts)

Now you can use that set to filter the resulting counter, removing words which don't appear in every subdict:

c = Counter((k,v) for k,v in c.items() if k in s)
print c.most_common()

A Counter isn't quite what you want. From the output you show, it looks like you want to keep track of both the total number of occurrences, and the number of pages the word occurs on.

data = {
    'page1': {
        'word1': 5,
        'word2': 10,
        'word3': 2,
    },
    'page2': {
        'word2': 2,
        'word3': 1,
    }
}

from collections import defaultdict
class Entry(object):
    def __init__(self):
        self.pages = 0
        self.occurrences = 0
    def __iadd__(self, occurrences):
        self.pages += 1
        self.occurrences += occurrences
        return self
    def __str__(self):
        return '{} occurrences on {} pages'.format(self.occurrences, self.pages)
    def __repr__(self):
        return '<Entry {} occurrences, {} pages>'.format(self.occurrences, self.pages)

counts = defaultdict(Entry)

for page_words in data.itervalues():
    for word, count in page_words.iteritems():
        counts[word] += count

for word, entry in counts.iteritems():
    print word, ':', entry

This produces the following output:

word1 : 5 occurrences on 1 pages
word3 : 3 occurrences on 2 pages
word2 : 12 occurrences on 2 pages

That would capture the information you want, the next step would be to find the most common n words. You could do that using a heapsort (which has the handy feature of not requiring that you sort the whole list of words by number of pages then occurrences - that might be important if you've got a lot of words in total, but n of 'top n' is relatively small).

from heapq import nlargest
def by_pages_then_occurrences(item):
    entry = item[1]
    return entry.pages, entry.occurrences
print nlargest(2, counts.iteritems(), key=by_pages_then_occurrences)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM