简体   繁体   中英

How to treat certain words as delimiters in nltk Python?

I'm trying to tokenize the below text with stopwords('is', 'the', 'was') as delimiters

The expected output is this:

['Walter', 
 'feeling anxious', 
 'He', 
 'diagnosed today,' 
 'He probably', 
 'best person I know']

This is the code which I trying to make the above output

import nltk 
stopwords = ['is', 'the', 'was']

sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")

sents_rm_stopwords = [] 

for sent in sents:
    sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))

My code output is this:

['Walter feeling anxious .',
 'He diagnosed today .', 
 'He probably best person I know .']

How can I make the expected output?

So the problem considers both stopwords and line delimiters. Assuming that we can define a line by the symbol . , you can introduce that to multiple splits by using re.split() .

import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)

results
>>
['Walter',
 'feeling anxious',
 'He',
 'diagnosed today',
 'He probably',
 'the best person I know',
 '']

Because we are using both single . and . with a whitespace after, the split results will return an additional '' . Assuming that this structure of sentences are consistent, you can slice the results to get your expected results.

result[:-1]
>>
['Walter',
 'feeling anxious',
 'He',
 'diagnosed today',
 'He probably',
 'the best person I know']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM