[英]Python: encounter problems in sentence segmenter, word tokenizer, and part-of-speech tagger
我正在嘗試將文本文件讀入Python,然后執行句子分段器,單詞標記器和詞性標記器。
這是我的代碼:
file=open('C:/temp/1.txt','r')
sentences = nltk.sent_tokenize(file)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
當我嘗試第二個命令時,它顯示錯誤:
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
sentences = nltk.sent_tokenize(file)
File "D:\Python\lib\site-packages\nltk\tokenize\__init__.py", line 76, in sent_tokenize
return tokenizer.tokenize(text)
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1217, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1262, in sentences_from_text
sents = [text[sl] for sl in self._slices_from_text(text)]
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1269, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
另一種嘗試:當我只嘗試一個句子時,例如“一條黃色的狗在貓身上咆哮”,前三個命令起作用,但最后一行卻出現此錯誤:(我想知道我是否沒有完全下載軟件包嗎?)
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
sentences = [nltk.pos_tag(sent) for sent in sentences]
File "D:\Python\lib\site-packages\nltk\tag\__init__.py", line 99, in pos_tag
tagger = load(_POS_TAGGER)
File "D:\Python\lib\site-packages\nltk\data.py", line 605, in load
resource_val = pickle.load(_open(resource_url))
ImportError: No module named numpy.core.multiarray
嗯...您確定錯誤在第二行嗎?
你似乎是使用比標准ASCII其他單引號和逗號字符'
和,
字符:
file=open(‘C:/temp/1.txt’,‘r’) # your version (WRONG)
file=open('C:/temp/1.txt', 'r') # right
Python甚至不能編譯它。 確實,當我嘗試使用它時,由於語法錯誤而無法使用。
更新:您發布了具有正確語法的更正版本。 追溯的錯誤消息非常簡單:您正在調用的函數似乎期望將一大塊文本作為其參數,而不是文件對象。 盡管我對NLTK一無所知,但花五秒鍾在Google上證實了這一點 。
嘗試這樣的事情:
file = open('C:/temp/1.txt','r')
text = file.read() # read the contents of the text file into a variable
result1 = nltk.sent_tokenize(text)
result2 = [nltk.word_tokenize(sent) for sent in result1]
result3 = [nltk.pos_tag(sent) for sent in result2]
更新:我將sentences
重命名為result
1/2/3,因為由於反復覆蓋同一變量,導致代碼實際執行混亂。 這不會影響語義,只是澄清第二行實際上對最終結果result3
。
首先打開文件,然后閱讀:
filename = 'C:/temp/1.txt'
infile = open(filename, 'r')
text = infile.read()
然后像這樣在nltk中鏈接工具:
tagged_words = [pos_tag(word_tokenize(i) for i in sent_tokenize(text)]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.