如何从NLTK随附的样本语料库中提取单词？

Question

NLTK comes with some samples of corpus at: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml NLTK随附了一些语料库示例，网址为： http : //nltk.googlecode.com/svn/trunk/nltk_data/index.xml

I want to have only text without encodings. 我只希望没有编码的文本。 I do not know how to extract such content. 我不知道如何提取此类内容。 What I want to extract are 我要提取的是

1) nps_chat: filename is like 10-19-20s_706posts.xml after unzip. 1）nps_chat：解压缩后的文件名类似于10-19-20s_706posts.xml。 Such file is XML format like: 这样的文件是XML格式，例如：

<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>

                <t pos="RB" word="now"/>
                <t pos="PRP" word="im"/>
                <t pos="VBD" word="left"/>
                <t pos="IN" word="with"/>
                <t pos="DT" word="this"/>
                <t pos="JJ" word="gay"/>
                <t pos="NN" word="name"/>
            </terminals>

        </Post>
            ...
            ...

I only want that actual post: 我只想要那个实际的帖子：

now im left with this gay name

How can do in NLTK or (whatever) to save the bare posts after stripping encodings in local disk? 在本地磁盘中剥离编码后，如何在NLTK中或（无论如何）保存裸帖？

2) switchboard transcript. 2）总机成绩单。 This type of file (filename is discourse after unzip) contains the following formats. 这种类型的文件（解压缩后的文件名是语篇）包含以下格式。 What I want is to strip preceding markers: 我想要的是去除前面的标记：

o A.1 utt1: Okay, /
qy A.1 utt2: have you ever served as a juror? /
ng B.2 utt1: Never. /
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
b A.3 utt1: Uh-huh. /
sd A.3 utt2: I never have either. /
% B.4 utt1: You haven't, {F huh. } /
...
...

I want to have only: 我只想拥有：

Okay, /
have you ever served as a juror? /
Never. /
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
Uh-huh. /
I never have either. /
You haven't, {F huh. } /
...
...

Thank you very much in advance. 提前非常感谢您。

Answer 1

First, you need to make a corpus reader for the corpus. 首先，您需要为corpus reader创建一个corpus reader 。 There are some corpus readers that you can use in nltk.corpus , such as: 您可以在nltk.corpus使用一些语料库阅读器，例如：

AlpinoCorpusReader
BNCCorpusReader
BracketParseCorpusReader
CMUDictCorpusReader
CategorizedCorpusReader
CategorizedPlaintextCorpusReader
CategorizedTaggedCorpusReader
ChunkedCorpusReader
ConllChunkCorpusReader
ConllCorpusReader
CorpusReader
DependencyCorpusReader
EuroparlCorpusReader
IEERCorpusReader
IPIPANCorpusReader
IndianCorpusReader
MacMorphoCorpusReader
NPSChatCorpusReader
NombankCorpusReader
PPAttachmentCorpusReader
Pl196xCorpusReader
PlaintextCorpusReader
PortugueseCategorizedPlaintextCorpusReader
PropbankCorpusReader
RTECorpusReader
SensevalCorpusReader
SinicaTreebankCorpusReader
StringCategoryCorpusReader
SwadeshCorpusReader
SwitchboardCorpusReader
SyntaxCorpusReader
TaggedCorpusReader
TimitCorpusReader
ToolboxCorpusReader
VerbnetCorpusReader
WordListCorpusReader
WordNetCorpusReader
WordNetICCorpusReader
XMLCorpusReader
YCOECorpusReader

Once you've made a corpus reader out of your corpus like so: 从语料库中选出语料库阅读器后，如下所示：

c = nltk.corpus.whateverCorpusReaderYouChoose(directoryWithCorpus, regexForFileTypes)

you can get the words out of the corpus by using the following code: 您可以使用以下代码将单词从语料库中删除：

paragraphs = [para for para in c.paras()]
for para in paragraphs:
    words = [word for sentence in para for word in sentence]

This should get you a list of all the words in all the paragraphs of your corpus. 这将为您提供语料库所有段落中所有单词的列表。

Hope this helps 希望这可以帮助

Answer 2

You can use .words() property from nltk corpus 您可以使用nltk语料库中的.words()属性

content = nps_chat.words()

This will give you all the words in a list 这将为您提供列表中的所有单词

['now', 'im', 'left', 'with', 'this', 'gay', 'name', ...]

如何从NLTK随附的样本语料库中提取单词？

问题描述

2 个解决方案

解决方案1
2 已采纳 2011-01-22 06:40:17

解决方案2
1 2017-02-10 03:08:39

如何从NLTK随附的样本语料库中提取单词？

问题描述

2 个解决方案

解决方案1 2 已采纳 2011-01-22 06:40:17

解决方案2 1 2017-02-10 03:08:39

解决方案1
2 已采纳 2011-01-22 06:40:17

解决方案2
1 2017-02-10 03:08:39