如何简单地提取布朗语料库NLTK中的单词和标签？

Question

NLTK has an interface to the brown corpus and the POS tags and it can be accessed as such: NLTK具有到棕色语料库和POS标签的接口，可以这样访问：

>>> from nltk.corpus import brown
>>> brown.tagged_sents()
[[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')], [(u'The', u'AT'), (u'jury', u'NN'), (u'further', u'RBR'), (u'said', u'VBD'), (u'in', u'IN'), (u'term-end', u'NN'), (u'presentments', u'NNS'), (u'that', u'CS'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'Executive', u'JJ-TL'), (u'Committee', u'NN-TL'), (u',', u','), (u'which', u'WDT'), (u'had', u'HVD'), (u'over-all', u'JJ'), (u'charge', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'election', u'NN'), (u',', u','), (u'``', u'``'), (u'deserves', u'VBZ'), (u'the', u'AT'), (u'praise', u'NN'), (u'and', u'CC'), (u'thanks', u'NNS'), (u'of', u'IN'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'of', u'IN-TL'), (u'Atlanta', u'NP-TL'), (u"''", u"''"), (u'for', u'IN'), (u'the', u'AT'), (u'manner', u'NN'), (u'in', u'IN'), (u'which', u'WDT'), (u'the', u'AT'), (u'election', u'NN'), (u'was', u'BEDZ'), (u'conducted', u'VBN'), (u'.', u'.')], ...]

The brown.tagged_sents() is a list and each element in the list is a sentence and the sentence is a list of tuples where the first elements is the word and the 2nd is the POS tag. brown.tagged_sents()是一个列表，列表中的每个元素是一个句子，该句子是元组的列表，其中第一个元素是单词，第二个是POS标签。

The goal is to process the brown corpus such that I get a file like this, where each line is a tab delimited sentence where the first column contains the words of the sentence separated by whitespace and the 2nd column contains the corresponding tags separated by whitespace: 目的是处理brown语料库，这样我得到一个这样的文件，其中每一行都是制表符分隔的句子，其中第一列包含用空格分隔的句子单词，第二列包含用空格分隔的相应标签：

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .  AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN NP$ JJ NN NN VBD `` AT NN '' CS DTI NNS VBD NN .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . AT NN RBR VBD IN NN NNS CS AT NN-TL JJ-TL NN-TL , WDT HVD JJ NN IN AT NN , `` VBZ AT NN CC NNS IN AT NN-TL IN-TL NP-TL '' IN AT NN IN WDT AT NN BEDZ VBN .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. . AT NP NN NN HVD BEN VBN IN NP-TL JJ-TL NN-TL NN-TL NP NP TO VB NNS IN JJ `` NNS '' IN AT JJ NN WDT BEDZ VBN IN NN-TL NP NP NP .

I have tried this: 我已经试过了：

from nltk.corpus import brown
tagged_sents = brown.tagged_sents()
fout = open('brown.txt', 'w')
fout.write('\n'.join([' '.join(sent)+'\t'+' '.join(tags)
                      for sent, tags in
                      [zip(*tagged_sent) for tagged_sent in tagged_sents]]))

And it works but there must be a better way to munge the corpus. 它可以工作，但是必须有更好的方法来语料库。

Answer 1

data = [[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR')]]

# takes the data in and throws it in a loop 
def data_printer(data):
    # adds each element to this string
    string = ''
    for dat in data:
        for da in dat:
            string += ' ' + da[0]
    print string
    return string

data_printer(data)

There is a better way to do it via ordered pairs. 有一种更好的方法可以通过有序对来实现。 This is a minimalistic way with no imports. 这是不导入的简约方式。

如何简单地提取布朗语料库NLTK中的单词和标签？

问题描述

1 个解决方案

解决方案1
0 2016-11-04 06:31:46

如何简单地提取布朗语料库NLTK中的单词和标签？

问题描述

1 个解决方案

解决方案1 0 2016-11-04 06:31:46

解决方案1
0 2016-11-04 06:31:46