繁体   English   中英

如何使用nlp标记句子

[英]How to tokenize sentence using nlp

我是NLP的新手。 我试图在python 3.7上使用nlp标记句子,所以我使用了以下代码

import nltk
text4="This is the first sentence.A gallon of milk in the U.S. cost 
$2.99.Is this the third sentence?Yes,it is!"
x=nltk.sent_tokenize(text4)
x[0]

我原以为x [0]将返回第一句话,但我得到了

Out[4]: 'This is the first sentence.A gallon of milk in the U.S. cost $2.99.Is this the third sentence?Yes,it is!'

我做错什么了吗?

您需要在句子中使用有效的空格和标点符号,才能使分词器正常运行:

import nltk

text4 = "This is a sentence. This is another sentence."
nltk.sent_tokenize(text4)

# ['This is a sentence.', 'This is another sentence.']

## Versus What you had before

nltk.sent_tokenize("This is a sentence.This is another sentence.")

# ['This is a sentence.This is another sentence.']

NLTK sent_tokenizer不能很好地处理格式错误的文本。 如果您提供适当的间距,则可以使用。

import nltk
nltk.download('punkt')
text4="This is the first sentence. A gallon of milk in the U.S. cost $2.99. Is this 
the third sentence? Yes, it is"
x=nltk.sent_tokenize(text4)
x[0]

或者您可以使用它。

import re
text4 = "This is the first sentence. A gallon of milk in the U.S. cost 2.99. Is this 
the third sentence? Yes it is"
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text4)
sentences

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM