[英]How to treat certain words as delimiters in nltk Python?
我正在尝试使用停用词('is','the','was')作为分隔符来标记以下文本
预期的输出是这样的:
['Walter',
'feeling anxious',
'He',
'diagnosed today,'
'He probably',
'best person I know']
这是我试图使上述输出的代码
import nltk
stopwords = ['is', 'the', 'was']
sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")
sents_rm_stopwords = []
for sent in sents:
sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))
我的代码输出是这样的:
['Walter feeling anxious .',
'He diagnosed today .',
'He probably best person I know .']
我怎样才能得到预期的输出?
所以这个问题同时考虑了停用词和行分隔符。 假设我们可以通过符号 定义一条线.
,您可以使用re.split()
将其引入多个拆分。
import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)
results
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know',
'']
因为我们同时使用 single .
和.
后面有一个空格,拆分结果将返回一个额外的''
。 假设这种句子结构是一致的,就可以对结果进行切片,得到你预期的结果。
result[:-1]
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.