Python NLTK Collocations for tagged text

Question

I'm not sure if this is possible but I thought I would ask just in case. Say you had a dataset of examples of the form "body | tags" for example

"I went to the store and bought some bread" | shopping food

I am wondering if there is a way to use NLTK Collocations to count the number of times body words and tags words cooccur in the data set. One example might be something like ("bread","food",598) where "bread" is a body word and "food" is a tag word and 598 is the number of times that they cooccur in the dataset

Answer 1

Without using NLTK, you can do this:

from collections import Counter
from itertools import product

documents = '''"foo bar is not a sentence" | tag1
"bar bar black sheep is not a real sheep" | tag2
"what the bar foo is not a foo bar" | tag1'''

documents = [i.split('|')[0].strip('" ') for i in documents.split('\n')]

collocations = Counter()

for i in documents:
    # Get all the possible word collocations with product
    # NOTE: this includes a token with itself. so we need 
    #       to remove the count for the token with itself.
    x = Counter(list(product(i.split(),i.split()))) \
            - Counter([(i,i) for i in i.split()])
    collocations+=x


for i in collocations:
    print i, collocations[i]

You will run into a problem of how to count collocation of same words in the sentence, for instance,

bar bar black sheep is not a real sheep

what is the collocation count for ('bar','bar')? is it 2 of 1? The code above gives 2, because the first bar collocate with the 2nd bar and the 2nd bar collocates with the first.

Python NLTK Collocations for tagged text

Question

1 answers

solution1
0 2013-12-15 12:55:36

Python NLTK Collocations for tagged text

Question

1 answers

solution1 0 2013-12-15 12:55:36

solution1
0 2013-12-15 12:55:36