简体繁体中英

Persist Tf-Idf data

原文 2012-06-19 13:50:20 4 1 python/ machine-learning/ scikit-learn/ pickle

I want to store the TF-IDF matrix so I don't have to recalculate it all the time. I am using scikit-learn's TfIdfVectorizer . Is it more efficient to pickle it or store it in a database?

Some context: I am using k-means clustering to provide document recommendation. Since new documents are added frequently, I would like to store the TF-IDF values of the documents so that I can recalculate the clusters.

1 answers

Pickling (especially using joblib.dump ) is good for short term storage, eg to save a partial results in an interactive session or ship a model from a development server to a production server.

However the pickling format is dependent on the class definitions of the models that might change from one version of scikit-learn to another.

I would recommend to write your own implementation-independant persistence model if you plan to keep the model for a long time and make it possible to load it in future versions of scikit-learn.

I would also recommend to use the HDF5 file format (for instance used in PyTables) or other database systems that have some kind of support for storing numerical arrays efficiently.

Also have a look at the internal CSR and COO datastructures for sparse matrix representation of scipy.sparse to come up with an efficient way to store those in a database.

TF-IDF function

TF-IDF Matrix In Python

tf-idf on pandas DataFrame

Computing TF-IDF on the whole dataset or only on training data?

Tf-Idf calculation in python

TF-IDF implementations in python

compute tf-idf with corpus

Python TF-IDF product

Calculate tf-idf of strings

TF-IDF vectorizer with python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question TF-IDF function TF-IDF Matrix In Python tf-idf on pandas DataFrame Computing TF-IDF on the whole dataset or only on training data? Tf-Idf calculation in python TF-IDF implementations in python compute tf-idf with corpus Python TF-IDF product Calculate tf-idf of strings TF-IDF vectorizer with python

Related Tags

Persist Tf-Idf data

Question

1 answers

solution1 7 ACCPTED 2012-06-20 14:04:58

solution1
7 ACCPTED 2012-06-20 14:04:58