I'm trying to use both counts and tfidf as features for a multinomial NB model. Here's my code:
text = ["this is spam", "this isn't spam"]
labels = [0,1]
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfTransformer(use_idf=True)
combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)
But I'm getting an error with FeatureUnion and tfidf:
TypeError: no supported conversion for types: (dtype('S18413'),)
Any idea why this could be happening? Is it not possible to have both counts and tfidf as features?
The error didn't come from the FeatureUnion
, it came from the TfidfTransformer
You should use TfidfVectorizer
instead of TfidfTransformer
, the transformer expects a numpy array as input and not plaintext, hence the TypeError
Also your test sentence is too small for Tfidf testing so try using a bigger one, here's an example:
from nltk.corpus import brown
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB
# Let's get more text from NLTK
text = [" ".join(i) for i in brown.sents()[:100]]
# I'm just gonna assign random tags.
labels = ['yes']*50 + ['no']*50
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.