[英]Python NLTK - Tokenize paragraphs into sentences and words
我在 a.txt 文件中有一些段落文本。 我正在嘗試將段落和 append 標記為句子和單詞列表。 我不確定自己做錯了什么,因為我設法獲得了句子,但沒有獲得單詞。 為此我一直把頭撞在牆上!
輸入:
This is sentence one,
Another sentence:
Third line.
所需的 output:
[
['This', 'is', 'sentence', 'one', ','],
['Another', 'sentence', ':'],
['Third', 'line', '.']
]
我的錯誤代碼和 output:
from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
for line in file:
sentences.append(sent_tokenize(line))
for line in sentences:
words_token = [word_tokenize(i) for i in line]
sentences_split_into_words.append(words_token)
----Result----
[
[['This', 'is', 'sentence', 'one', ',']],
[['Another', 'sentence', ':']],
[['Third', 'line', '.']]
]
我也嘗試過,但它返回錯誤“預期的字符串或類似字節的對象”:
for line in sentences:
sentences_split_into_words.append(word_tokenize(line))
試試這個代碼
from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
for line in file:
sentences.append(sent_tokenize(line))
sentences_split_into_words = []
for line in sentences:
words_token = [word_tokenize(i) for i in line]
sentences_split_into_words.extend(words_token)
參考: https://www.programiz.com/python-programming/methods/list/extend
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.