簡體   English   中英

Python NLTK - 將段落標記為句子和單詞

[英]Python NLTK - Tokenize paragraphs into sentences and words

我在 a.txt 文件中有一些段落文本。 我正在嘗試將段落和 append 標記為句子和單詞列表。 我不確定自己做錯了什么,因為我設法獲得了句子,但沒有獲得單詞。 為此我一直把頭撞在牆上!

輸入:

This is sentence one,
Another sentence:
Third line.

所需的 output:

[
 ['This', 'is', 'sentence', 'one', ','],
 ['Another', 'sentence', ':'],
 ['Third', 'line', '.']
]

我的錯誤代碼和 output:

from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
    for line in file:
        sentences.append(sent_tokenize(line))

for line in sentences:
    words_token = [word_tokenize(i) for i in line]
    sentences_split_into_words.append(words_token)

----Result----
    [
     [['This', 'is', 'sentence', 'one', ',']],
     [['Another', 'sentence', ':']],
     [['Third', 'line', '.']]
    ]

我也嘗試過,但它返回錯誤“預期的字符串或類似字節的對象”:

for line in sentences:
    sentences_split_into_words.append(word_tokenize(line))

試試這個代碼

from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
    for line in file:
        sentences.append(sent_tokenize(line))
sentences_split_into_words = []
for line in sentences:
    words_token = [word_tokenize(i) for i in line]
    sentences_split_into_words.extend(words_token)

參考: https://www.programiz.com/python-programming/methods/list/extend

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM