简体   繁体   中英

How to find the most common words using spacy?

I'm using spacy with python and its working fine for tagging each word but I was wondering if it was possible to find the most common words in a string. Also is it possible to get the most common nouns, verbs, adverbs and so on?

There's a count_by function included but I cant seem to get it to run in any meaningful way.

I recently had to count frequency of all the tokens in a text file. You can filter out words to get POS tokens you like using the pos_ attribute. Here is a simple example:

import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(u'Your text here')
# all tokens that arent stop words or punctuations
words = [token.text
         for token in doc
         if not token.is_stop and not token.is_punct]

# noun tokens that arent stop words or punctuations
nouns = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "NOUN")]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)

This should look basically the same as counting anything else in Python. spaCy lets you just iterate over the document, and you get back a sequence of Token objects. These can be used to access the annotations.

from __future__ import print_function, unicode_literals
import spacy
from collections import defaultdict, Counter

nlp = spacy.load('en')

pos_counts = defaultdict(Counter)
doc = nlp(u'My text here.')

for token in doc:
    pos_counts[token.pos][token.orth] += 1

for pos_id, counts in sorted(pos_counts.items()):
    pos = doc.vocab.strings[pos_id]
    for orth_id, count in counts.most_common():
        print(pos, count, doc.vocab.strings[orth_id])

Note that the .orth and .pos attributes are integers. You can get the strings that they map to via the .orth_ and .pos_ attributes. The .orth attribute is the unnormalised view of the token, there's also the .lower, .lemma etc string-view. You might want to bind a .norm function, to do your own string normalisation. See the docs for details.

The integers are useful for your counts because you can make your counting program much more memory efficient, if you're counting over a large corpus. You could also store the frequent counts in a numpy array, for additional speed and efficiency. If you don't want to bother with this, feel free to count with the .orth_ attribute directly, or use its alias .text.

Note that the .pos attribute in the snippet above gives a coarse-grained set of part-of-speech tags. The richer treebank tags are available on the .tag attribute.

I'm adding to this thread quite late. HOWEVER, there is, in fact, a built-in way to do this using the doc.count_by() function in spacy.

import spacy
import spacy.attrs
nlp = spacy.load("en_core_web_sm")
doc = nlp("It all happened between November 2007 and November 2008")

# Returns integers that map to parts of speech
counts_dict = doc.count_by(spacy.attrs.IDS['POS'])

# Print the human readable part of speech tags
for pos, count in counts_dict.items():
    human_readable_tag = doc.vocab[pos].text
    print(human_readable_tag, count)

The output is :

VERB 1

ADP 1

CCONJ 1

DET 1

NUM 2

PRON 1

PROPN 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM