简体   繁体   English

如何获取 python 列表中出现频率最高的 10 个字符串

[英]how to get the 10 most frequent strings in a list in python

I have a list that has 93 different strings.我有一个包含 93 个不同字符串的列表。 I need to find the 10 most frequent strings and the return must be in order from most frequent to least frequent.我需要找到 10 个最频繁的字符串,并且返回必须按从最频繁到最不频繁的顺序排列。

mylist = ['"and', '"beware', '`twas', 'all', 'all', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'arms', 'as', 'as', 'awhile', 'back', 'bandersnatch', 'beamish', 'beware', 'bird', 'bite', 'blade', 'borogoves', 'borogoves', 'boy', 'brillig']
 # this is just a sample of the actual list.

I dont have the newest version of python and cannot use a counter.我没有最新版本的 python,无法使用计数器。

You could use a Counter from the collections module to do this.您可以使用collections模块中的Counter来执行此操作。

from collections import Counter
c = Counter(mylist)

Then doing c.most_common(10) returns然后执行c.most_common(10)返回

[('and', 13),
 ('all', 2),
 ('as', 2),
 ('borogoves', 2),
 ('boy', 1),
 ('blade', 1),
 ('bandersnatch', 1),
 ('beware', 1),
 ('bite', 1),
 ('arms', 1)]

David's answer is the best - but if you are using a version of Python that does not include Counter from the collections module (which was introduced in Python 2.7), you can use this implementation of a counter class that does the same thing.大卫的回答是最好的 - 但如果您使用的 Python 版本不包含来自 collections 模块的计数器(在 Python 2.7 中引入),您可以使用计数器 class 的这个实现来做同样的事情。 I suspect that it would be slower than the module, but will do the same thing.我怀疑它会比模块慢,但会做同样的事情。

David's solution is the best.大卫的解决方案是最好的。

But probably more for fun than anything, here is a solution that does not import any module:但可能更多的是为了好玩,这里有一个不导入任何模块的解决方案:

dicto = {}

for ele in mylist:
    try:
        dicto[ele] += 1
    except KeyError:
        dicto[ele] = 1

top_10 = sorted(dicto.iteritems(), key = lambda k: k[1], reverse = True)[:10] 

Result:结果:

>>> top_10
[('and', 13), ('all', 2), ('as', 2), ('borogoves', 2), ('boy', 1), ('blade', 1), ('bandersnatch', 1), ('beware', 1), ('bite', 1), ('arms', 1)]

EDIT:编辑:

Answering the follow up question:回答后续问题:

new_dicto = {}

for val, key in zip(dicto.itervalues(), dicto.iterkeys()):

    try:
        new_dicto[val].append(key)
    except KeyError:
        new_dicto[val] = [key]

alph_sorted = sorted([(key,sorted(val)) for key,val in zip(new_dicto.iterkeys(), new_dicto.itervalues())], reverse = True)

Result:结果:

>>> alph_sorted
[(13, ['and']), (2, ['all', 'as', 'borogoves']), (1, ['"and', '"beware', '`twas', 'arms', 'awhile', 'back', 'bandersnatch', 'beamish', 'beware', 'bird', 'bite', 'blade', 'boy', 'brillig'])]

The words that show up once are sorted alphabetically, if you notice some words have extra quotation marks in them.出现一次的单词按字母顺序排序,如果您注意到某些单词中有额外的引号。

EDIT:编辑:

Answering another follow up question:回答另一个后续问题:

top_10 = []

for tup in alph_sorted:
    for word in tup[1]:
        top_10.append(word)
        if len(top_10) == 10:
            break

Result:结果:

>>> top_10
['and', 'all', 'as', 'borogoves', '"and', '"beware', '`twas', 'arms', 'awhile', 'back']

Without using Counter as the modified version of the question requests不使用Counter作为问题请求的修改版本

Changed to use heap.nlargest as suggested by @Duncan按照heap.nlargest的建议更改为使用 heap.nlargest

>>> from collections import defaultdict
>>> from operator import itemgetter
>>> from heapq import nlargest
>>> mylist = ['"and', '"beware', '`twas', 'all', 'all', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'arms', 'as', 'as', 'awhile', 'back', 'bandersnatch', 'beamish', 'beware', 'bird', 'bite', 'blade', 'borogoves', 'borogoves', 'boy', 'brillig']
>>> c = defaultdict(int)
>>> for item in mylist:
        c[item] += 1


>>> [word for word,freq in nlargest(10,c.iteritems(),key=itemgetter(1))]
['and', 'all', 'as', 'borogoves', 'boy', 'blade', 'bandersnatch', 'beware', 'bite', 'arms']

In case your Python Version does not support Counter, you can do the way Counter is implemented如果你的 Python 版本不支持 Counter,你可以按照 Counter 的实现方式

>>> import operator,collections,heapq
>>> counter = collections.defaultdict(int)
>>> for elem in mylist:
    counter[elem]+=1        
>>> heapq.nlargest(10,counter.iteritems(),operator.itemgetter(1))
[('and', 13), ('all', 2), ('as', 2), ('borogoves', 2), ('boy', 1), ('blade', 1), ('bandersnatch', 1), ('beware', 1), ('bite', 1), ('arms', 1)]

If you see the Counter Class, it creates a dictionary of the occurrence of all the elements present in the Iterable It then puts the data in an heapq, key is the value of the dictionary and retrieves the nargest如果你看到 Counter Class,它会创建一个包含 Iterable 中出现的所有元素的字典,然后将数据放入 heapq 中,key 是字典的值并检索 nargest

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM