簡體   English   中英

將段落標記為句子,然后在 NLTK 中標記為單詞

[英]Tokenize a paragraph into sentence and then into words in NLTK

我正在嘗試將整個段落輸入到我的文字處理器中,以便先拆分成句子,然后再拆分成單詞。

我嘗試了以下代碼但它不起作用,

    #text is the paragraph input
    sent_text = sent_tokenize(text)
    tokenized_text = word_tokenize(sent_text.split)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)

但是這不起作用並給我錯誤。 那么如何將段落標記為句子然后是單詞呢?

一個示例段落:

這東西似乎壓倒了那只黑褐色的小狗,讓他感到震驚,傷到了他的心。 他絕望地倒在孩子的腳邊。 重擊一拳,加上一句幼稚的告誡,他翻了個身,用一種奇怪的方式握住了他的爪子。 與此同時,他用耳朵和眼睛向孩子做了一個小小的祈禱。

**警告:**這只是來自互聯網的隨機文本,我不擁有上述內容。

您可能打算遍歷sent_text

import nltk

sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
for sentence in sent_text:
    tokenized_text = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)

這是一個較短的版本。 這將為您提供一個數據結構,其中包含每個單獨的句子以及句子中的每個標記。 我更喜歡 TweetTokenizer 來處理凌亂的現實世界語言。 句子標記器被認為是不錯的,但要注意不要在這一步之后降低你的單詞大小寫,因為它可能會影響檢測混亂文本邊界的准確性。

from nltk.tokenize import TweetTokenizer, sent_tokenize

tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in 
nltk.sent_tokenize(input_text)]
print(tokens_sentences)

這是輸出的樣子,我對其進行了清理,以便結構突出:

[
['This', 'thing', 'seemed', 'to', 'overpower', 'and', 'astonish', 'the', 'little', 'dark-brown', 'dog', ',', 'and', 'wounded', 'him', 'to', 'the', 'heart', '.'], 
['He', 'sank', 'down', 'in', 'despair', 'at', 'the', "child's", 'feet', '.'], 
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'], 
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
]
import nltk  

textsample ="This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child."  

sentences = nltk.sent_tokenize(textsample)  
words = nltk.word_tokenize(textsample)  
sentences 
[w for w in words if w.isalpha()]

上面的最后一行將確保輸出中只有單詞而不是特殊字符 句子輸出如下

['This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart.',
 "He sank down in despair at the child's feet.",
 'When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner.',
 'At the same time with his ears and his eyes he offered a small prayer to the child.']

去掉特殊字符后輸出的詞如下

['This',
 'thing',
 'seemed',
 'to',
 'overpower',
 'and',
 'astonish',
 'the',
 'little',
 'dog',
 'and',
 'wounded',
 'him',
 'to',
 'the',
 'heart',
 'He',
 'sank',
 'down',
 'in',
 'despair',
 'at',
 'the',
 'child',
 'feet',
 'When',
 'the',
 'blow',
 'was',
 'repeated',
 'together',
 'with',
 'an',
 'admonition',
 'in',
 'childish',
 'sentences',
 'he',
 'turned',
 'over',
 'upon',
 'his',
 'back',
 'and',
 'held',
 'his',
 'paws',
 'in',
 'a',
 'peculiar',
 'manner',
 'At',
 'the',
 'same',
 'time',
 'with',
 'his',
 'ears',
 'and',
 'his',
 'eyes',
 'he',
 'offered',
 'a',
 'small',
 'prayer',
 'to',
 'the',
 'child']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM