简体   繁体   中英

How to extract words from sample corpus that comes with NLTK?

NLTK comes with some samples of corpus at: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

I want to have only text without encodings. I do not know how to extract such content. What I want to extract are

1) nps_chat: filename is like 10-19-20s_706posts.xml after unzip. Such file is XML format like:

<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>

                <t pos="RB" word="now"/>
                <t pos="PRP" word="im"/>
                <t pos="VBD" word="left"/>
                <t pos="IN" word="with"/>
                <t pos="DT" word="this"/>
                <t pos="JJ" word="gay"/>
                <t pos="NN" word="name"/>
            </terminals>

        </Post>
            ...
            ...

I only want that actual post:

now im left with this gay name

How can do in NLTK or (whatever) to save the bare posts after stripping encodings in local disk?

2) switchboard transcript. This type of file (filename is discourse after unzip) contains the following formats. What I want is to strip preceding markers:

o A.1 utt1: Okay, /
qy A.1 utt2: have you ever served as a juror? /
ng B.2 utt1: Never. /
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
b A.3 utt1: Uh-huh. /
sd A.3 utt2: I never have either. /
% B.4 utt1: You haven't, {F huh. } /
...
... 

I want to have only:

Okay, /
have you ever served as a juror? /
Never. /
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
Uh-huh. /
I never have either. /
You haven't, {F huh. } /
...
... 

Thank you very much in advance.

First, you need to make a corpus reader for the corpus. There are some corpus readers that you can use in nltk.corpus , such as:

AlpinoCorpusReader
BNCCorpusReader
BracketParseCorpusReader
CMUDictCorpusReader
CategorizedCorpusReader
CategorizedPlaintextCorpusReader
CategorizedTaggedCorpusReader
ChunkedCorpusReader
ConllChunkCorpusReader
ConllCorpusReader
CorpusReader
DependencyCorpusReader
EuroparlCorpusReader
IEERCorpusReader
IPIPANCorpusReader
IndianCorpusReader
MacMorphoCorpusReader
NPSChatCorpusReader
NombankCorpusReader
PPAttachmentCorpusReader
Pl196xCorpusReader
PlaintextCorpusReader
PortugueseCategorizedPlaintextCorpusReader
PropbankCorpusReader
RTECorpusReader
SensevalCorpusReader
SinicaTreebankCorpusReader
StringCategoryCorpusReader
SwadeshCorpusReader
SwitchboardCorpusReader
SyntaxCorpusReader
TaggedCorpusReader
TimitCorpusReader
ToolboxCorpusReader
VerbnetCorpusReader
WordListCorpusReader
WordNetCorpusReader
WordNetICCorpusReader
XMLCorpusReader
YCOECorpusReader

Once you've made a corpus reader out of your corpus like so:

c = nltk.corpus.whateverCorpusReaderYouChoose(directoryWithCorpus, regexForFileTypes)

you can get the words out of the corpus by using the following code:

paragraphs = [para for para in c.paras()]
for para in paragraphs:
    words = [word for sentence in para for word in sentence]

This should get you a list of all the words in all the paragraphs of your corpus.

Hope this helps

You can use .words() property from nltk corpus

content = nps_chat.words()

This will give you all the words in a list

['now', 'im', 'left', 'with', 'this', 'gay', 'name', ...]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM