简体   繁体   中英

ConditionalFreqDist to find most frequent POS tags for words

I am trying to fidn the most frequent POS tag for words in the dataset but struggling with the ConditionalFrewDist part.

import nltk
tw = nltk.corpus.brown.tagged_words()

train_idx = int(0.8*len(tw))
training_set = tw[:train_idx]
test_set = tw[train_idx:]

words= list(zip(*training_set))[0]

from nltk import ConditionalFreqDist
ofd= ConditionalFreqDist(word for word in list(zip(*training_set))[0])

tags= list(zip(*training_set))[1]
ofd.tabulate(conditions= words, samples= tags)

ValueError: too many values to unpack (expected 2)

As you might read in documents the ConditionalFreqDist helps you to calculate

A collection of frequency distributions for a single experiment run under different conditions.

The only thing you must provide, is the list of items and conditions which can be translated (in this problem) to words and corresponding POS tags. The code with minimal changes would look like this and would calculate distributions for the whole corpus but tabulate the results for the first 10th items and conditions(preventing a crash):

import nltk
from nltk import ConditionalFreqDist

tw = nltk.corpus.brown.tagged_words()
train_idx = int(0.8*len(tw))
training_set = tw[:train_idx]
test_set = tw[train_idx:]
words= list(zip(*training_set))[0] # items
tags= list(zip(*training_set))[1] # conditions

ofd= ConditionalFreqDist((tag, word) for tag, word in zip(words, tags)) # simple comprehension pattern in python
ofd.tabulate(conditions= words[:10], samples= tags[:10]) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM