简体   繁体   English

如何使用spacy找到最常用的单词?

[英]How to find the most common words using spacy?

I'm using spacy with python and its working fine for tagging each word but I was wondering if it was possible to find the most common words in a string.我在 python 中使用 spacy 并且它可以很好地标记每个单词,但我想知道是否有可能在字符串中找到最常见的单词。 Also is it possible to get the most common nouns, verbs, adverbs and so on?也有可能得到最常见的名词、动词、副词等吗?

There's a count_by function included but I cant seem to get it to run in any meaningful way.包含一个 count_by 函数,但我似乎无法让它以任何有意义的方式运行。

I recently had to count frequency of all the tokens in a text file.我最近不得不计算文本文件中所有标记的频率。 You can filter out words to get POS tokens you like using the pos_ attribute.您可以使用 pos_ 属性过滤掉单词以获取您喜欢的 POS 标记。 Here is a simple example:这是一个简单的例子:

import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(u'Your text here')
# all tokens that arent stop words or punctuations
words = [token.text
         for token in doc
         if not token.is_stop and not token.is_punct]

# noun tokens that arent stop words or punctuations
nouns = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "NOUN")]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)

This should look basically the same as counting anything else in Python.这应该与在 Python 中计算其他任何内容基本相同。 spaCy lets you just iterate over the document, and you get back a sequence of Token objects. spaCy 允许您遍历文档,然后返回一系列 Token 对象。 These can be used to access the annotations.这些可用于访问注释。

from __future__ import print_function, unicode_literals
import spacy
from collections import defaultdict, Counter

nlp = spacy.load('en')

pos_counts = defaultdict(Counter)
doc = nlp(u'My text here.')

for token in doc:
    pos_counts[token.pos][token.orth] += 1

for pos_id, counts in sorted(pos_counts.items()):
    pos = doc.vocab.strings[pos_id]
    for orth_id, count in counts.most_common():
        print(pos, count, doc.vocab.strings[orth_id])

Note that the .orth and .pos attributes are integers.请注意, .orth 和 .pos 属性是整数。 You can get the strings that they map to via the .orth_ and .pos_ attributes.您可以通过 .orth_ 和 .pos_ 属性获取它们映射到的字符串。 The .orth attribute is the unnormalised view of the token, there's also the .lower, .lemma etc string-view. .orth 属性是标记的非规范化视图,还有 .lower、.lemma 等字符串视图。 You might want to bind a .norm function, to do your own string normalisation.您可能想要绑定一个 .norm 函数,以进行您自己的字符串规范化。 See the docs for details.有关详细信息,请参阅文档。

The integers are useful for your counts because you can make your counting program much more memory efficient, if you're counting over a large corpus.整数对您的计数很有用,因为如果您对大型语料库进行计数,您可以使计数程序的内存效率更高。 You could also store the frequent counts in a numpy array, for additional speed and efficiency.您还可以将频繁计数存储在一个 numpy 数组中,以提高速度和效率。 If you don't want to bother with this, feel free to count with the .orth_ attribute directly, or use its alias .text.如果您不想为此烦恼,可以直接使用 .orth_ 属性进行计数,或者使用其别名 .text。

Note that the .pos attribute in the snippet above gives a coarse-grained set of part-of-speech tags.请注意,上面代码片段中的 .pos 属性给出了一组粗粒度的词性标签。 The richer treebank tags are available on the .tag attribute. .tag 属性上提供了更丰富的树库标签。

I'm adding to this thread quite late.我很晚才添加到这个线程。 HOWEVER, there is, in fact, a built-in way to do this using the doc.count_by() function in spacy.然而,事实上,有一种内置的方法可以使用 spacy 中的 doc.count_by() 函数来做到这一点。

import spacy
import spacy.attrs
nlp = spacy.load("en_core_web_sm")
doc = nlp("It all happened between November 2007 and November 2008")

# Returns integers that map to parts of speech
counts_dict = doc.count_by(spacy.attrs.IDS['POS'])

# Print the human readable part of speech tags
for pos, count in counts_dict.items():
    human_readable_tag = doc.vocab[pos].text
    print(human_readable_tag, count)

The output is :输出是:

VERB 1动词 1

ADP 1 1

CCONJ 1 CCONJ 1

DET 1 DET 1

NUM 2 NUM 2

PRON 1序数 1

PROPN 2道具2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM