簡體   English   中英

Python:在句子分段器,單詞標記器和詞性標記器中遇到問題

[英]Python: encounter problems in sentence segmenter, word tokenizer, and part-of-speech tagger

我正在嘗試將文本文件讀入Python,然后執行句子分段器,單詞標記器和詞性標記器。

這是我的代碼:

file=open('C:/temp/1.txt','r')
sentences = nltk.sent_tokenize(file)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

當我嘗試第二個命令時,它顯示錯誤:

Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
sentences = nltk.sent_tokenize(file)
File "D:\Python\lib\site-packages\nltk\tokenize\__init__.py", line 76, in sent_tokenize
return tokenizer.tokenize(text)
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1217, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1262, in sentences_from_text
sents = [text[sl] for sl in self._slices_from_text(text)]
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1269, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

另一種嘗試:當我只嘗試一個句子時,例如“一條黃色的狗在貓身上咆哮”,前三個命令起作用,但最后一行卻出現此錯誤:(我想知道我是否沒有完全下載軟件包嗎?)

Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
sentences = [nltk.pos_tag(sent) for sent in sentences]
File "D:\Python\lib\site-packages\nltk\tag\__init__.py", line 99, in pos_tag
tagger = load(_POS_TAGGER)
File "D:\Python\lib\site-packages\nltk\data.py", line 605, in load
resource_val = pickle.load(_open(resource_url))
ImportError: No module named numpy.core.multiarray

嗯...您確定錯誤在第二行嗎?

你似乎是使用比標准ASCII其他單引號和逗號字符',字符:

file=open(‘C:/temp/1.txt’,‘r’) # your version (WRONG)
file=open('C:/temp/1.txt', 'r') # right

Python甚至不能編譯它。 確實,當我嘗試使用它時,由於語法錯誤而無法使用。

更新:您發布了具有正確語法的更正版本。 追溯的錯誤消息非常簡單:您正在調用的函數似乎期望將一大塊文本作為其參數,而不是文件對象。 盡管我對NLTK一無所知,但花五秒鍾在Google上證實了這一點

嘗試這樣的事情:

file = open('C:/temp/1.txt','r')
text = file.read() # read the contents of the text file into a variable
result1 = nltk.sent_tokenize(text)
result2 = [nltk.word_tokenize(sent) for sent in result1]
result3 = [nltk.pos_tag(sent) for sent in result2]

更新:我將sentences重命名為result 1/2/3,因為由於反復覆蓋同一變量,導致代碼實際執行混亂。 不會影響語義,只是澄清第二行實際上對最終結果result3

首先打開文件,然后閱讀:

filename = 'C:/temp/1.txt'
infile = open(filename, 'r')
text = infile.read()

然后像這樣在nltk中鏈接工具:

tagged_words = [pos_tag(word_tokenize(i) for i in sent_tokenize(text)]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM