Machine Learning - How to extract features from pipeline

Question

I am totaly new to the field and currently I am stuck. Here is What I want and what I did:

I have a Dataframe tht is solit in Train and Test dataset. The Training features are twitter messages, the lables are assigned categories. I set up a tokenizer (called clean_text ) that keeps only relevant words and strips the messages down to the core information. The model including a grid search, that looks as follows:

def build_model():
   pipeline = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=clean_text)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(
                RandomForestClassifier()
                ))      
        ])

    # parameters to grid search
    parameters = { 'vectorizer__max_features' : [50],#, 72, 144, 288, 576, 1152],
            'clf__estimator__n_estimators' : [100]}#, 100] }

    # initiating GridSearchCV method
    model = GridSearchCV(pipeline, param_grid=parameters, cv = 5)

    return model

The fitting works fine, as well as the evaluation. Not I am not sure, if the model is set up correctly and if the features are the most used tokens in the messsages (in the above case 50) or if there is an error.

Now the question: Is there a way to print the 50 features and see if they look right?

Best Felix

Answer 1

With no sample information, this is the best guess. Please check if the following works. If you have sample data, we can help you better.

print(vectorizer.vocabulary_)

Answer 2

this should work, or share sample dataframe

model.estimator.named_steps['vectorizer'].get_feature_names()

Machine Learning - How to extract features from pipeline

Question

2 answers

solution1
0 2020-04-13 02:01:13

solution2
0 2020-04-13 06:10:28

Machine Learning - How to extract features from pipeline

Question

2 answers

solution1 0 2020-04-13 02:01:13

solution2 0 2020-04-13 06:10:28

solution1
0 2020-04-13 02:01:13

solution2
0 2020-04-13 06:10:28