简体   繁体   English

如何在Python中从nltk.book读取nltk.text.Text文件?

[英]How to read nltk.text.Text files from nltk.book in Python?

i'm learning a lot about Natural Language Processing with nltk, can do a lot of things, but I'm not being able to find the way to read Texts from the package. 我正在学习有关使用nltk进行自然语言处理的很多知识,可以做很多事情,但是我无法找到从包中读取文本的方法。 I have tried things like this: 我已经尝试过像这样的事情:

from nltk.book import *
text6 #Brings the title of the text
open(text6).read()
#or
nltk.book.text6.read()

But it doesn't seem to work, because it has no fileid. 但是它似乎没有用,因为它没有fileid。 No one seems to have asked this question before, so I assume the answer should be easy. 以前似乎没有人问过这个问题,所以我认为答案应该很简单。 Do you know what's the way to read those texts or how to convert them into a string? 您是否知道阅读这些文本的方式或如何将它们转换为字符串? Thanks in advance 提前致谢

Lets dig into the code =) 让我们深入研究代码=)

Firstly, the nltk.book code resides on https://github.com/nltk/nltk/blob/develop/nltk/book.py 首先, nltk.book代码位于https://github.com/nltk/nltk/blob/develop/nltk/book.py

If we look carefully, the texts are loaded as an nltk.Text objects, eg for text6 from https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36 : 如果我们仔细看,这些文本将作为nltk.Text对象加载,例如来自https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36的 text6

text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")

The Text object comes from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286 , you can read more about how you can use it from http://www.nltk.org/book/ch02.html Text对象来自https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286 ,您可以从http://www.nltk.org/了解更多有关如何使用它的信息。 book / ch02.html

The webtext is a corpus from nltk.corpus so to get to the raw text of nltk.book.text6 , you could load the webtext directly, eg webtext是从语料库nltk.corpus所以去的原始文本nltk.book.text6 ,你可以直接加载webtext,如

>>> from nltk.corpus import webtext
>>> webtext.raw('grail.txt')

The fileids comes only when you load a PlaintextCorpusReader object, not from the Text object (processed object): 仅当您加载PlaintextCorpusReader对象而不是从Text对象(已处理对象)加载时,才提供文件fileids

>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
...     print(filename)
... 
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt

Looks like they already break it up into tokens for you. 看起来他们已经为您将其分解为代币。

from nltk.book import text6

text6.tokens

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM