簡體   English   中英

如何從NLTK隨附的樣本語料庫中提取單詞?

[英]How to extract words from sample corpus that comes with NLTK?

NLTK隨附了一些語料庫示例,網址為: http : //nltk.googlecode.com/svn/trunk/nltk_data/index.xml

我只希望沒有編碼的文本。 我不知道如何提取此類內容。 我要提取的是

1)nps_chat:解壓縮后的文件名類似於10-19-20s_706posts.xml。 這樣的文件是XML格式,例如:

<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>

                <t pos="RB" word="now"/>
                <t pos="PRP" word="im"/>
                <t pos="VBD" word="left"/>
                <t pos="IN" word="with"/>
                <t pos="DT" word="this"/>
                <t pos="JJ" word="gay"/>
                <t pos="NN" word="name"/>
            </terminals>

        </Post>
            ...
            ...

我只想要那個實際的帖子:

now im left with this gay name

在本地磁盤中剝離編碼后,如何在NLTK中或(無論如何)保存裸帖?

2)總機成績單。 這種類型的文件(解壓縮后的文件名是語篇)包含以下格式。 我想要的是去除前面的標記:

o A.1 utt1: Okay, /
qy A.1 utt2: have you ever served as a juror? /
ng B.2 utt1: Never. /
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
b A.3 utt1: Uh-huh. /
sd A.3 utt2: I never have either. /
% B.4 utt1: You haven't, {F huh. } /
...
... 

我只想擁有:

Okay, /
have you ever served as a juror? /
Never. /
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
Uh-huh. /
I never have either. /
You haven't, {F huh. } /
...
... 

提前非常感謝您。

首先,您需要為corpus reader創建一個corpus reader 您可以在nltk.corpus使用一些語料庫閱讀器,例如:

AlpinoCorpusReader
BNCCorpusReader
BracketParseCorpusReader
CMUDictCorpusReader
CategorizedCorpusReader
CategorizedPlaintextCorpusReader
CategorizedTaggedCorpusReader
ChunkedCorpusReader
ConllChunkCorpusReader
ConllCorpusReader
CorpusReader
DependencyCorpusReader
EuroparlCorpusReader
IEERCorpusReader
IPIPANCorpusReader
IndianCorpusReader
MacMorphoCorpusReader
NPSChatCorpusReader
NombankCorpusReader
PPAttachmentCorpusReader
Pl196xCorpusReader
PlaintextCorpusReader
PortugueseCategorizedPlaintextCorpusReader
PropbankCorpusReader
RTECorpusReader
SensevalCorpusReader
SinicaTreebankCorpusReader
StringCategoryCorpusReader
SwadeshCorpusReader
SwitchboardCorpusReader
SyntaxCorpusReader
TaggedCorpusReader
TimitCorpusReader
ToolboxCorpusReader
VerbnetCorpusReader
WordListCorpusReader
WordNetCorpusReader
WordNetICCorpusReader
XMLCorpusReader
YCOECorpusReader

從語料庫中選出語料庫閱讀器后,如下所示:

c = nltk.corpus.whateverCorpusReaderYouChoose(directoryWithCorpus, regexForFileTypes)

您可以使用以下代碼將單詞從語料庫中刪除:

paragraphs = [para for para in c.paras()]
for para in paragraphs:
    words = [word for sentence in para for word in sentence]

這將為您提供語料庫所有段落中所有單詞的列表。

希望這可以幫助

您可以使用nltk語料庫中的.words()屬性

content = nps_chat.words()

這將為您提供列表中的所有單詞

['now', 'im', 'left', 'with', 'this', 'gay', 'name', ...]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM