如何将某些单词视为 nltk Python 中的分隔符？

Question

我正在尝试使用停用词（'is'，'the'，'was'）作为分隔符来标记以下文本

预期的输出是这样的：

['Walter', 
 'feeling anxious', 
 'He', 
 'diagnosed today,' 
 'He probably', 
 'best person I know']

这是我试图使上述输出的代码

import nltk 
stopwords = ['is', 'the', 'was']

sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")

sents_rm_stopwords = [] 

for sent in sents:
    sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))

我的代码输出是这样的：

['Walter feeling anxious .',
 'He diagnosed today .', 
 'He probably best person I know .']

我怎样才能得到预期的输出？

Answer 1

所以这个问题同时考虑了停用词和行分隔符。 假设我们可以通过符号定义一条线. ，您可以使用re.split()将其引入多个拆分。

import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)

results
>>
['Walter',
 'feeling anxious',
 'He',
 'diagnosed today',
 'He probably',
 'the best person I know',
 '']

因为我们同时使用 single . 和. 后面有一个空格，拆分结果将返回一个额外的'' 。 假设这种句子结构是一致的，就可以对结果进行切片，得到你预期的结果。

result[:-1]
>>
['Walter',
 'feeling anxious',
 'He',
 'diagnosed today',
 'He probably',
 'the best person I know']

如何将某些单词视为 nltk Python 中的分隔符？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-23 04:35:20

如何将某些单词视为 nltk Python 中的分隔符？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-23 04:35:20

解决方案1
1 已采纳 2020-10-23 04:35:20