def preprocess_text(text):
tokenized_document = nltk.tokenize.RegexpTokenizer('[a-zA-Z0-9\']+')
cleaned_tokens = [word.lower() for word in tokenized_document if word.lower() not in stop_words]
stemmed_text = [nltk.stem.PorterStemmer().stem(word) for word in cleaned_tokens]
return stemmed_text
data["Text"] = data["Text"].apply(preprocess_text)
data.head()
Error message:
TypeError: 'RegexpTokenizer' object is not iterable
Your tokenized_document
object is an instance of nltk.tokenize.RegexpTokenizer
. You are trying to iterate over the values of tokenized_document
(in the for word in tokenized_document
expression) but the nltk.tokenize.RegexpTokenizer
doesn't support that usage. (That's what the 'RegexpTokenizer' object is not iterable
message is telling you.)
The source of the problem is that you have not called the tokenize
method, and haven't used the text
parameter at all.
Fix: call .tokenize(text)
:
tokenized_document = nltk.tokenize.RegexpTokenizer('[a-zA-Z0-9\']+').tokenize(text)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.