[英]How to tokenize sentence using nlp
我是NLP的新手。 我試圖在python 3.7上使用nlp標記句子,所以我使用了以下代碼
import nltk
text4="This is the first sentence.A gallon of milk in the U.S. cost
$2.99.Is this the third sentence?Yes,it is!"
x=nltk.sent_tokenize(text4)
x[0]
我原以為x [0]將返回第一句話,但我得到了
Out[4]: 'This is the first sentence.A gallon of milk in the U.S. cost $2.99.Is this the third sentence?Yes,it is!'
我做錯什么了嗎?
您需要在句子中使用有效的空格和標點符號,才能使分詞器正常運行:
import nltk
text4 = "This is a sentence. This is another sentence."
nltk.sent_tokenize(text4)
# ['This is a sentence.', 'This is another sentence.']
## Versus What you had before
nltk.sent_tokenize("This is a sentence.This is another sentence.")
# ['This is a sentence.This is another sentence.']
NLTK sent_tokenizer不能很好地處理格式錯誤的文本。 如果您提供適當的間距,則可以使用。
import nltk
nltk.download('punkt')
text4="This is the first sentence. A gallon of milk in the U.S. cost $2.99. Is this
the third sentence? Yes, it is"
x=nltk.sent_tokenize(text4)
x[0]
或者您可以使用它。
import re
text4 = "This is the first sentence. A gallon of milk in the U.S. cost 2.99. Is this
the third sentence? Yes it is"
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text4)
sentences
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.