简体   繁体   English

访问包含ngram的计数器的元素

[英]accessing elements of a counter containing ngrams

I am taking a string, tokenizing it, and want to look at the most common bigrams, here is what I have got: 我正在取一个字符串,将其标记化,并想看看最常见的二元组,这是我得到的:

import nltk
import collections
from nltk import ngrams

someString="this is some text. this is some more test. this is even more text."
tokens=nltk.word_tokenize(someString)
tokens=[token.lower() for token in tokens if len()>1]

bigram=ngrams(tokens,2)
aCounter=collections.Counter(bigram)

If I: 如果我:

print(aCounter)

Then it will output the bigrams in sorted order. 然后它将按排序顺序输出二元组。

for element in aCounter:
     print(element)

Will print the elements, but not with a count, and not in order of the count. 将打印元素,但不打印计数,也不打印计数顺序。 I want to do a for loop, where I print out the X most common bigrams in a text. 我想做一个for循环,在这里我在文本中打印出X个最常见的双字母组。

I am essentially trying to learn both Python and nltk at the same time, so this could be why I am struggling here (I assume this is a trivial thing). 我本质上是在尝试同时学习Python和nltk,所以这可能就是为什么我在这里努力的原因(我认为这是一件微不足道的事情)。

You're probably looking for something that already exists, namely, the most_common method on counters. 您可能正在寻找已经存在的东西,即计数器上的most_common方法。 From the docs: 从文档:

Return a list of the n most common elements and their counts from the most common to the least. 返回n最常见元素的列表及其从最常见到最小的计数。 If n is omitted or None , most_common() returns all elements in the counter. 如果省略nNone ,则most_common()返回计数器中的所有元素。 Elements with equal counts are ordered arbitrarily: 相等计数的元素可以任意排序:

You can call it and supply a value n in order to get the n most common value-count pairs. 您可以调用它并提供一个值n以获得n最常见的值-计数对。 For example: 例如:

from collections import Counter

# initialize with silly value.
c = Counter('aabbbccccdddeeeeefffffffghhhhiiiiiii')

# Print 4 most common values and their respective count.
for val, count in c.most_common(4):
    print("Value {0} -> Count {1}".format(val, count))

Which prints out: 打印出:

Value f -> Count 7
Value i -> Count 7
Value e -> Count 5
Value h -> Count 4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM