简体   繁体   中英

Extract text features from dataframe

I have dataframe with two text fields and other features like this format :

 message            feature_1      feature_2       score        text
 'This is the text'     4             7            10          extra text
 'This is more text'    3             2            8           and this is another text

Now my goal is to predict the score, when trying to transform this dataframe into a feature matrix to feed it into my machine learning model, here is what I have did :

    # Create vectorizer for function to use
    vectorizer = TfidfVectorizer()
    # combine the numerical features with the TFIDF generated matrix
    X = sp.sparse.hstack( (vectorizer.fit_transform(df.message),
                      df[['feature_1', 'feature_2']].values, vectorizer.fit_transform(df.text)),
                      format='csr')

Now when printing the shape of my X matrix I got 2x13, but when I check the X_columsn like this :

X_columns = vectorizer.get_feature_names() + df[['feature_1', 'feature_2']].columns.tolist()

I don't get all the words in the corpus, it bring me just the words existing in df.text and other features attribute without words in df.message .

['and', 'another', 'extra', 'is', 'text', 'this', 'feature_1', 'feature_2']

How can I make X contain all my dataframe features !!

As a general rule, fit your vectorizer on the entire corpus of texts to calculate the vocabulary and then transform all text to vectors afterwards.

You are fitting the vectorizer two times, so the second call to fit_transform overwrites the first and updates the vocabulary accordingly. Try fitting on both text fields first to calculate the vocabulary over the whole corpus, and then transform each text field, like this:

from sklearn.feature_extraction.text import TfidfVectorizer
import scipy as sp

vectorizer = TfidfVectorizer()
vectorizer.fit(df.message.append(df.text))
X = sp.sparse.hstack( (vectorizer.transform(df.message),
                 df[['feature_1', 'feature_2']].values, vectorizer.transform(df.text)),
                 format='csr')

X_columns = vectorizer.get_feature_names() + df[['feature_1', 'feature_2']].columns.tolist()

This gives me:

X_columns
Out[51]: ['and', 'another', 'extra', 'is', 'more', 'text', 'the', 'this', 'feature_1', 'feature_2']

Is that what you're after?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM