如何在Python中从nltk.book读取nltk.text.Text文件？

Question

i'm learning a lot about Natural Language Processing with nltk, can do a lot of things, but I'm not being able to find the way to read Texts from the package. 我正在学习有关使用nltk进行自然语言处理的很多知识，可以做很多事情，但是我无法找到从包中读取文本的方法。 I have tried things like this: 我已经尝试过像这样的事情：

from nltk.book import *
text6 #Brings the title of the text
open(text6).read()
#or
nltk.book.text6.read()

But it doesn't seem to work, because it has no fileid. 但是它似乎没有用，因为它没有fileid。 No one seems to have asked this question before, so I assume the answer should be easy. 以前似乎没有人问过这个问题，所以我认为答案应该很简单。 Do you know what's the way to read those texts or how to convert them into a string? 您是否知道阅读这些文本的方式或如何将它们转换为字符串？ Thanks in advance 提前致谢

Answer 1

Lets dig into the code =) 让我们深入研究代码=）

Firstly, the nltk.book code resides on https://github.com/nltk/nltk/blob/develop/nltk/book.py 首先， nltk.book代码位于https://github.com/nltk/nltk/blob/develop/nltk/book.py

If we look carefully, the texts are loaded as an nltk.Text objects, eg for text6 from https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36 : 如果我们仔细看，这些文本将作为nltk.Text对象加载，例如来自https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36的 text6 ：

text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")

The Text object comes from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286 , you can read more about how you can use it from http://www.nltk.org/book/ch02.html Text对象来自https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286 ，您可以从http://www.nltk.org/了解更多有关如何使用它的信息。 book / ch02.html

The webtext is a corpus from nltk.corpus so to get to the raw text of nltk.book.text6 , you could load the webtext directly, eg 该webtext是从语料库nltk.corpus所以去的原始文本nltk.book.text6 ，你可以直接加载webtext，如

>>> from nltk.corpus import webtext
>>> webtext.raw('grail.txt')

The fileids comes only when you load a PlaintextCorpusReader object, not from the Text object (processed object): 仅当您加载PlaintextCorpusReader对象而不是从Text对象（已处理对象）加载时，才提供文件fileids ：

>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
...     print(filename)
... 
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt

Answer 2

Looks like they already break it up into tokens for you. 看起来他们已经为您将其分解为代币。

from nltk.book import text6

text6.tokens

如何在Python中从nltk.book读取nltk.text.Text文件？

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-03-15 09:48:59

解决方案2
1 2018-03-14 18:19:49

如何在Python中从nltk.book读取nltk.text.Text文件？

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-03-15 09:48:59

解决方案2 1 2018-03-14 18:19:49

解决方案1
3 已采纳 2018-03-15 09:48:59

解决方案2
1 2018-03-14 18:19:49