简体   繁体   English

ConditionalFreqDist 查找单词最常见的 POS 标签

[英]ConditionalFreqDist to find most frequent POS tags for words

I am trying to fidn the most frequent POS tag for words in the dataset but struggling with the ConditionalFrewDist part.我正在尝试为数据集中的单词找到最常见的 POS 标签,但在 ConditionalFrewDist 部分苦苦挣扎。

import nltk
tw = nltk.corpus.brown.tagged_words()

train_idx = int(0.8*len(tw))
training_set = tw[:train_idx]
test_set = tw[train_idx:]

words= list(zip(*training_set))[0]

from nltk import ConditionalFreqDist
ofd= ConditionalFreqDist(word for word in list(zip(*training_set))[0])

tags= list(zip(*training_set))[1]
ofd.tabulate(conditions= words, samples= tags)

ValueError: too many values to unpack (expected 2) ValueError:要解包的值太多(预期为 2)

As you might read in documents the ConditionalFreqDist helps you to calculate正如您可能在文档中看到的那样, ConditionalFreqDist可以帮助您计算

A collection of frequency distributions for a single experiment run under different conditions.在不同条件下运行的单个实验的频率分布集合。

The only thing you must provide, is the list of items and conditions which can be translated (in this problem) to words and corresponding POS tags.您唯一必须提供的是可以(在此问题中)翻译为单词和相应 POS 标签的项目和条件列表。 The code with minimal changes would look like this and would calculate distributions for the whole corpus but tabulate the results for the first 10th items and conditions(preventing a crash):更改最少的代码看起来像这样,它将计算整个语料库的分布,但将前 10 个项目和条件的结果制成表格(防止崩溃):

import nltk
from nltk import ConditionalFreqDist

tw = nltk.corpus.brown.tagged_words()
train_idx = int(0.8*len(tw))
training_set = tw[:train_idx]
test_set = tw[train_idx:]
words= list(zip(*training_set))[0] # items
tags= list(zip(*training_set))[1] # conditions

ofd= ConditionalFreqDist((tag, word) for tag, word in zip(words, tags)) # simple comprehension pattern in python
ofd.tabulate(conditions= words[:10], samples= tags[:10]) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM