简体   繁体   中英

Extract words from string to create featureset nltk

I am using NLTK and NLTK-Trainer to do some sentiment analysis. I have an accurate algorithm pickled. When I follow the instruction s provided by NLTK-Trainer, everything works well.

Here what works (returns the desired output)

>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words])
>>> classifier.classify(feats)

'feats' looks like this:

Out[52]: {'a': True, 'in': True, 'sentence': True, 'some': True, 'words': True}

However , I don't want to type in words separated by commas and apostrophes each time. I have a script that does some preprocessing on the text and returns a string that looks like this.

"[['words'], ['in'], ['a'], ['sentence']]"`

However, when I try to define the 'feats' with the string, I am left with something that looks like this

{' ': True,
 "'": True,
 ',': True,
 '[': True,
 ']': True,
 'a': True,
 'b': True,
 'c': True,
 'e': True,
 'h': True,
 'i': True,
 'l': True,
 'n': True,
 'o': True,
 'p': True,
 'r': True,
 's': True,
 'u': True}

Obviously the classifier function isn't very effective with this input. It appears like the 'feats' definition is extracting individual letters from the text string instead of whole words. How do I fix this?

I am not sure to understand but I would suggest:

words = nltk.word_tokenize("some words in a sentence")
feats = {word: True for word in words}
classifier.classify(feats)

If you want to use your pre-processed text, try:

text = "[['words'], ['in'], ['a'], ['sentence']]"
words = text[3:len(text)-3].split("'], ['")
feats = {word: True for word in words}
classifier.classify(feats)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM