简体   繁体   English

如何简单地提取布朗语料库NLTK中的单词和标签?

[英]How to extract the words and tags in Brown corpus NLTK simply?

NLTK has an interface to the brown corpus and the POS tags and it can be accessed as such: NLTK具有到棕色语料库和POS标签的接口,可以这样访问:

>>> from nltk.corpus import brown
>>> brown.tagged_sents()
[[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')], [(u'The', u'AT'), (u'jury', u'NN'), (u'further', u'RBR'), (u'said', u'VBD'), (u'in', u'IN'), (u'term-end', u'NN'), (u'presentments', u'NNS'), (u'that', u'CS'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'Executive', u'JJ-TL'), (u'Committee', u'NN-TL'), (u',', u','), (u'which', u'WDT'), (u'had', u'HVD'), (u'over-all', u'JJ'), (u'charge', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'election', u'NN'), (u',', u','), (u'``', u'``'), (u'deserves', u'VBZ'), (u'the', u'AT'), (u'praise', u'NN'), (u'and', u'CC'), (u'thanks', u'NNS'), (u'of', u'IN'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'of', u'IN-TL'), (u'Atlanta', u'NP-TL'), (u"''", u"''"), (u'for', u'IN'), (u'the', u'AT'), (u'manner', u'NN'), (u'in', u'IN'), (u'which', u'WDT'), (u'the', u'AT'), (u'election', u'NN'), (u'was', u'BEDZ'), (u'conducted', u'VBN'), (u'.', u'.')], ...]

The brown.tagged_sents() is a list and each element in the list is a sentence and the sentence is a list of tuples where the first elements is the word and the 2nd is the POS tag. brown.tagged_sents()是一个列表,列表中的每个元素是一个句子,该句子是元组的列表,其中第一个元素是单词,第二个是POS标签。

The goal is to process the brown corpus such that I get a file like this, where each line is a tab delimited sentence where the first column contains the words of the sentence separated by whitespace and the 2nd column contains the corresponding tags separated by whitespace: 目的是处理brown语料库,这样我得到一个这样的文件,其中每一行都是制表符分隔的句子,其中第一列包含用空格分隔的句子单词,第二列包含用空格分隔的相应标签:

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .  AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN NP$ JJ NN NN VBD `` AT NN '' CS DTI NNS VBD NN .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . AT NN RBR VBD IN NN NNS CS AT NN-TL JJ-TL NN-TL , WDT HVD JJ NN IN AT NN , `` VBZ AT NN CC NNS IN AT NN-TL IN-TL NP-TL '' IN AT NN IN WDT AT NN BEDZ VBN .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. . AT NP NN NN HVD BEN VBN IN NP-TL JJ-TL NN-TL NN-TL NP NP TO VB NNS IN JJ `` NNS '' IN AT JJ NN WDT BEDZ VBN IN NN-TL NP NP NP .

I have tried this: 我已经试过了:

from nltk.corpus import brown
tagged_sents = brown.tagged_sents()
fout = open('brown.txt', 'w')
fout.write('\n'.join([' '.join(sent)+'\t'+' '.join(tags)
                      for sent, tags in
                      [zip(*tagged_sent) for tagged_sent in tagged_sents]]))

And it works but there must be a better way to munge the corpus. 它可以工作,但是必须有更好的方法来语料库。

data = [[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR')]]

# takes the data in and throws it in a loop 
def data_printer(data):
    # adds each element to this string
    string = ''
    for dat in data:
        for da in dat:
            string += ' ' + da[0]
    print string
    return string

data_printer(data)

There is a better way to do it via ordered pairs. 有一种更好的方法可以通过有序对来实现。 This is a minimalistic way with no imports. 这是不导入的简约方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM