简体   繁体   中英

How to add threshold limit to TF-IDF values in a sparse matrix

I am using sklearn.feature_extraction.text, TfidfTransformer to get the TF_IDF values for my corpus.

This is how my code looks like

    X = dataset[:,0]
    Y = dataset[:,1]

    for index, item in enumerate(X):
        reqJson = json.loads(item, object_pairs_hook=OrderedDict)
        X[index] = json.dumps(reqJson, separators=(',', ':'))
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(X)


    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = (tfidf_transformer.fit_transform(X_train_counts))

    #(58720, 167216) is the size of my sparse matrix


    for i in range (0,58720):
        for j in range (0,167216):
            print(i,j)
            if X_train_tfidf[i,j]>0.35:
                X_train_tfidf[i,j]=0

As you can see that I want to filter out tf-idf values which more than 0.35 so that I can reduce my feature set and make my model more time efficient but using a for loop just makes worse. I have looked into the documentation of TfidfTransformer but cannot find a way to make it any better. Any ideas or tips? Thank you.

It sounds like this question is trying to ignore frequent words.

The TfidfVectorizer ( not TfidfTransformer ) implementation includes a max_df parameter for:

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).

In the following example, word1 and word3 occur in >50% of documents, so setting max_df=0.5 means the resulting array only includes word2 :

from sklearn.feature_extraction.text import TfidfVectorizer

raw_data = [
    "word1 word2 word3",
    "word1 word1 word1",
    "word2 word2 word3",
    "word1 word1 word3",
]

vect = TfidfVectorizer(max_df=0.5)
X = vect.fit_transform(raw_data)

print(vect.get_feature_names_out())
print(X.todense())
['word2']
[[1.]
 [0.]
 [1.]
 [0.]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM