Counting distinct words in a speech using tagset in nltk

Question

I am currently having trouble with this.

I was given a task is to implement a function that return a sorted list of distinct words with a given part of speech. I am required to use NLTK's pos_tag_sents and NLTK's tokeniser to count the specific words.

I had a similar question to this and got it working thanks to some help from other users from Stack Overflow. And trying to use the same method to solve this problem.

Here is what I have have so far in my code:

import nltk
import collections
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

def pos_counts(text, pos_list):
    """Return the sorted list of distinct words with a given part of speech
    >>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
    >>> pos_counts(emma, ['DET', 'NOUN'])
    [14352, 32029] - expected result
    """

    text = nltk.word_tokenize(text)
    tempword = nltk.pos_tag_sents(text, tagset="universal")
    counts = nltk.FreqDist(tempword)

    return [counts[x] or 0 for x in pos_list]

There are a doctest that should give the result of: [14352, 32029]

I ran my code and got this error message:

Error
**********************************************************************
File "C:/Users/PycharmProjects/a1/a1.py", line 29, in a1.pos_counts
Failed example:
    pos_counts(emma, ['DET', 'NOUN'])
Exception raised:
    Traceback (most recent call last):
      File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.4\helpers\pycharm\docrunner.py", line 140, in __run
        compileflags, 1), test.globs)
      File "<doctest a1.pos_counts[1]>", line 1, in <module>
        pos_counts(emma, ['DET', 'NOUN'])
      File "C:/Users/PycharmProjects/a1/a1.py", line 35, in pos_counts
        counts = nltk.FreqDist(tempword)
      File "C:\Users\PycharmProjects\a1\venv\lib\site-packages\nltk\probability.py", line 108, in __init__
        Counter.__init__(self, samples)
      File "C:\Users\AppData\Local\Programs\Python\Python36-32\lib\collections\__init__.py", line 535, in __init__
        self.update(*args, **kwds)
      File "C:\Users\PycharmProjects\a1\venv\lib\site-packages\nltk\probability.py", line 146, in update
        super(FreqDist, self).update(*args, **kwargs)
      File "C:\Users\AppData\Local\Programs\Python\Python36-32\lib\collections\__init__.py", line 622, in update
        _count_elements(self, iterable)
    TypeError: unhashable type: 'list'

I feel I'm getting close but I don't know what I'm doing wrong.

Any help will be very appreciated. Thank you.

Answer 1

One way to do it would be like this:

import nltk

def pos_count(text, pos_list):
    sents = nltk.tokenize.sent_tokenize(text)
    words = (nltk.word_tokenize(sent) for sent in sents)
    tagged = nltk.pos_tag_sents(words, tagset='universal')
    tags = [tag[1] for sent in tagged for tag in sent]
    counts = nltk.FreqDist(tag for tag in tags if tag in pos_list)
    return counts

It's all very well explained in the nltk book . Test:

In [3]: emma = nltk.corpus.gutenberg.raw('austen-emma.txt')

In [4]: pos_count(emma, ['DET', 'NOUN'])
Out[4]: FreqDist({'DET': 14352, 'NOUN': 32029})

EDIT : it's a good idea to use FreqDist when you need to count things such as part of speech tags. I don't think it's very clever to have a function return a plain list with results, in principle how would you know which number represent which tag?

A possible (imho bad) solution is to return a sorted list of FreqDist.values() . This way the results are sorted in accordance with alphabetic order of the tag names. If you really want this replace return counts with return [item[1] for item in sorted(counts.items())] in the definition of the function above.

Counting distinct words in a speech using tagset in nltk

Question

1 answers

solution1
2 ACCPTED 2018-03-11 09:58:21

Counting distinct words in a speech using tagset in nltk

Question

1 answers

solution1 2 ACCPTED 2018-03-11 09:58:21

solution1
2 ACCPTED 2018-03-11 09:58:21