簡體   English   中英

如何在NLTK中使用書籍功能(例如concoordance)?

[英]How do I use the book functions (e.g. concoordance) in NLTK?

我正在閱讀這個精彩的教程

我下載了一個名為book的集合:

>>> import nltk
>>> nltk.download()

和進口文本:

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811

然后我可以在這些文本上運行命令:

>>> text1.concordance("monstrous")

如何在我自己的數據集上運行這些nltk命令? 這些集合是否與python中的對象book相同?

你是對的,很難找到book.py模塊的文檔。 因此,我們必須弄臟手,看看代碼,(見這里 )。 看看book.py ,用book模塊做一致和所有花哨的東西:

首先,您必須將原始文本放入nltk的corpus類中,有關詳細信息,請參閱使用NLTK創建新語料庫

其次,您將語料庫單詞讀入NLTK的Text類。 然后,您可以使用http://nltk.org/book/ch01.html中顯示的功能

from nltk.corpus import PlaintextCorpusReader
from nltk.text import Text

# For example, I create an example text file
text1 = '''
This is a story about a foo bar. Foo likes to go to the bar and his last name is also bar. At home, he kept a lot of gold chocolate bars.
'''
text2 = '''
One day, foo went to the bar in his neighborhood and was shot down by a sheep, a blah blah black sheep.
'''
# Creating the corpus
corpusdir = './mycorpus/' 
with (corpusdir+'text1.txt','w') as fout:
    fout.write(text1)
with (corpusdir+'text2.txt','w') as fout:
    fout.write(text2, fout)

# Read the the example corpus into NLTK's corpus class.
mycorpus = PlaintextCorpusReader(corpusdir, '.*')

# Read the NLTK's corpus into NLTK's text class, 
# where your book-like concoordance search is available
mytext = Text(mycorpus.words())

mytext.concoordance('foo')

注意:您可以使用其他NLTK的CorpusReaders甚至指定自定義段落/句子/單詞標記符和編碼但現在,我們將堅持默認

使用來自bogs.princeton.edu的NLTK Cheatsheet進行文本分析 https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf

使用您自己的文本:

打開文件進行閱讀

file = open('myfile.txt') 

在啟動Python之前確保您位於正確的目錄中 - 或者提供完整的路徑規范。

閱讀文件:

t = file.read() 

對文本進行標記:

tokens = nltk.word_tokenize(t)

轉換為NLTK文本對象:

text = nltk.Text(tokens)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM