How to use the imbalanced library with sklearn pipeline?

Question

I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB

my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial .

The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.

from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
                                         ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
                         ('tfidf', TfidfTransformer(use_idf= True)),\
                          ('enn', EditedNearestNeighbours()),\
                          ('renn', RepeatedEditedNearestNeighbours()),\
                          ('clf-gnb',  MultinomialNB()),])

Error:

TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',

Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?

Answer 1

It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :

*steps : list of estimators.

You should modify your code to :

pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
                                         ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
                         TfidfTransformer(use_idf= True),\
                         EditedNearestNeighbours(),\
                         RepeatedEditedNearestNeighbours(),\
                         MultinomialNB())

How to use the imbalanced library with sklearn pipeline?

Question

1 answers

solution1
1 2019-01-24 13:18:39

How to use the imbalanced library with sklearn pipeline?

Question

1 answers

solution1 1 2019-01-24 13:18:39

solution1
1 2019-01-24 13:18:39