简体   繁体   中英

Group features of TF-IDF vector in scikit-learn

I'm using scikit-learn to train a text classification model based on TF-IDF feature vector by following piece of code:

model = naive_bayes.MultinomialNB()
feature_vector_train = TfidfVectorizer().fit_transform(X)
model.fit(self.feature_vector_train, Y)

I need to rank the extracted features in decreasing order of their TF-IDF weight and group them into two non-overlapped sets of features and finally train two different classification model. How can I group the main feature vector into an odd-ranked set and an even-ranked set?

The result of your TfidfVectorizer is an nxm matrix n is the number of documents and m is the number of unique words. Thus, each column in feature_vector_train corresponds to a specific word from your dataset. Adapting a solution from this tutorial should allow you to extract the highest and lowest weighted words:

vectorizer = TfidfVectorizer()
feature_vector_train = vectorizer.fit_transform(X)
feature_names = vectorizer.get_feature_names()

total_tfidf_weights = feature_vector_train.sum(axis=0) #this assumes you only want a straight sum of each feature's weight across all documents
#alternatively, you could use vectorizer.transform(feature_names) to get the values of each feature in isolation

#sort the feature names and the tfidf weights together by zipping them
sorted_names_weights = sorted(zip(feature_names, total_tfidf_Weights), key = lambda x: x[1]), reversed=True) #the key argument tells sorted according to column 1. reversed means sort from largest to smallest
#unzip the names and weights
sorted_features_names, sorted_total_tfidf_weights = zip(*sorted_names_weights)

From this point you should be able to separate the features as you'd like. Once you have them into two groups, group1 and group2 , you can separate them into two matrices like this:

#create a feature_name to column index mapping
column_mapping = dict((name, i) for i, name, in enumerate(feature_names))

#get the submatrices
group1_column_indexes = [column_mapping[feat] for feat in group1]
group1_feature_vector_train  = feature_vector_train[:,group1_column_indexes] #all rows, but only group1 columns

group2_column_indexes = [column_mapping[feat] for feat in group2]
group2_feature_vector_train  = feature_vector_train[:,group2_column_indexes]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM