简体   繁体   中英

Python NLTK Collocations for tagged text

I'm not sure if this is possible but I thought I would ask just in case. Say you had a dataset of examples of the form "body | tags" for example

"I went to the store and bought some bread" | shopping food

I am wondering if there is a way to use NLTK Collocations to count the number of times body words and tags words cooccur in the data set. One example might be something like ("bread","food",598) where "bread" is a body word and "food" is a tag word and 598 is the number of times that they cooccur in the dataset

Without using NLTK, you can do this:

from collections import Counter
from itertools import product

documents = '''"foo bar is not a sentence" | tag1
"bar bar black sheep is not a real sheep" | tag2
"what the bar foo is not a foo bar" | tag1'''

documents = [i.split('|')[0].strip('" ') for i in documents.split('\n')]

collocations = Counter()

for i in documents:
    # Get all the possible word collocations with product
    # NOTE: this includes a token with itself. so we need 
    #       to remove the count for the token with itself.
    x = Counter(list(product(i.split(),i.split()))) \
            - Counter([(i,i) for i in i.split()])
    collocations+=x


for i in collocations:
    print i, collocations[i]

You will run into a problem of how to count collocation of same words in the sentence, for instance,

bar bar black sheep is not a real sheep

what is the collocation count for ('bar','bar')? is it 2 of 1? The code above gives 2, because the first bar collocate with the 2nd bar and the 2nd bar collocates with the first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM