简体   繁体   English

在大泡菜数据上使用 collections.Counter

[英]Using collections.Counter on large pickle data

I have a pickle file with over a million words in it.我有一个包含超过一百万字的泡菜文件。 The pickle file can be downloaded from here .可以从这里下载 pickle 文件。

I want to use Counter on these words to sort them.我想对这些单词使用Counter对它们进行排序。 Here's my code:这是我的代码:

with open('data/words.pkl', 'rb') as f:
    data = list(pickle.load(f))

print(Counter(data).most_common(3))

The printed result changes every time, but it's usually like this:打印的结果每次都会变化,但通常是这样的:

[('', 1), ('fraksiyonal', 1), ('editado', 1)]

So, it seems to be not counting the words and every word's occurrence is 1. What am I doing wrong?所以,它似乎没有计算单词,每个单词的出现都是 1。我做错了什么?

Edit: As an example of how data list looks:编辑:作为数据列表外观的示例:

print(data[0:10])

Result:结果:

['', 'hillview', 'dipnota', 'дол', 'censusi', 'quathie', 'kalacağının', 'stralauerstrasse', 'sbaglio', 'keny']

The problem is with your data.问题在于您的数据。 In a comment you said,你在评论中说,

I changed it to list because pickle load data is a set object我将其更改为列表,因为泡菜负载数据是一组 object

Sets can't contain duplicates, hence why the counts are always 1.集合不能包含重复项,因此计数始终为 1。


due credit to jasonharper for posting the comment that figured it out应归功于jasonharper发表的评论

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM