[英]How to read nltk.text.Text files from nltk.book in Python?
i'm learning a lot about Natural Language Processing with nltk, can do a lot of things, but I'm not being able to find the way to read Texts from the package. 我正在学习有关使用nltk进行自然语言处理的很多知识,可以做很多事情,但是我无法找到从包中读取文本的方法。 I have tried things like this:
我已经尝试过像这样的事情:
from nltk.book import *
text6 #Brings the title of the text
open(text6).read()
#or
nltk.book.text6.read()
But it doesn't seem to work, because it has no fileid. 但是它似乎没有用,因为它没有fileid。 No one seems to have asked this question before, so I assume the answer should be easy.
以前似乎没有人问过这个问题,所以我认为答案应该很简单。 Do you know what's the way to read those texts or how to convert them into a string?
您是否知道阅读这些文本的方式或如何将它们转换为字符串? Thanks in advance
提前致谢
Lets dig into the code =) 让我们深入研究代码=)
Firstly, the nltk.book
code resides on https://github.com/nltk/nltk/blob/develop/nltk/book.py 首先,
nltk.book
代码位于https://github.com/nltk/nltk/blob/develop/nltk/book.py
If we look carefully, the texts are loaded as an nltk.Text
objects, eg for text6
from https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36 : 如果我们仔细看,这些文本将作为
nltk.Text
对象加载,例如来自https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36的 text6
:
text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")
The Text
object comes from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286 , you can read more about how you can use it from http://www.nltk.org/book/ch02.html Text
对象来自https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286 ,您可以从http://www.nltk.org/了解更多有关如何使用它的信息。 book / ch02.html
The webtext
is a corpus from nltk.corpus
so to get to the raw text of nltk.book.text6
, you could load the webtext directly, eg 该
webtext
是从语料库nltk.corpus
所以去的原始文本nltk.book.text6
,你可以直接加载webtext,如
>>> from nltk.corpus import webtext
>>> webtext.raw('grail.txt')
The fileids
comes only when you load a PlaintextCorpusReader
object, not from the Text
object (processed object): 仅当您加载
PlaintextCorpusReader
对象而不是从Text
对象(已处理对象)加载时,才提供文件fileids
:
>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
... print(filename)
...
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt
Looks like they already break it up into tokens for you. 看起来他们已经为您将其分解为代币。
from nltk.book import text6
text6.tokens
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.