简体   繁体   中英

How remove few columns from countvectorized sparse dataframe in pandas

I have around 2000 text features inside countvectorized data frame. I have list of 800 text feature columns which have actual feature importance contribution for prediction model. I want keep only this 800 columns and remove rest 1200 columnns as they do not contribute much towards my prediction.

How can I do that. I have the list of columns to be maintained in text file.

cv = CountVectorizer( max_features = 2000,analyzer='word') 
    cv_text = cv.fit_transform(data.pop('text'))
    for i, col in enumerate(cv.get_feature_names()):
        data[col] = pd.SparseSeries(cv_text[:, i].toarray().ravel(), fill_value=0)

It should be easy:

data = data.drop(list_of_cols_to_drop, axis=1)

or

data = data.drop(data.columns.difference(list_of_needed_cols), axis=1)

there is a drop method for SparseDataFrame objects.

From the docstring:

In [139]: pd.SparseDataFrame.drop?
Signature: pd.SparseDataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='rai
se')
Docstring:
Return new object with labels in requested axis removed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM