简体   繁体   中英

Machine Learning - How to extract features from pipeline

I am totaly new to the field and currently I am stuck. Here is What I want and what I did:

I have a Dataframe tht is solit in Train and Test dataset. The Training features are twitter messages, the lables are assigned categories. I set up a tokenizer (called clean_text ) that keeps only relevant words and strips the messages down to the core information. The model including a grid search, that looks as follows:

def build_model():
   pipeline = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=clean_text)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(
                RandomForestClassifier()
                ))      
        ])

    # parameters to grid search
    parameters = { 'vectorizer__max_features' : [50],#, 72, 144, 288, 576, 1152],
            'clf__estimator__n_estimators' : [100]}#, 100] }

    # initiating GridSearchCV method
    model = GridSearchCV(pipeline, param_grid=parameters, cv = 5)

    return model

The fitting works fine, as well as the evaluation. Not I am not sure, if the model is set up correctly and if the features are the most used tokens in the messsages (in the above case 50) or if there is an error.

Now the question: Is there a way to print the 50 features and see if they look right?

Best Felix

With no sample information, this is the best guess. Please check if the following works. If you have sample data, we can help you better.

print(vectorizer.vocabulary_)

this should work, or share sample dataframe

model.estimator.named_steps['vectorizer'].get_feature_names()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM