简体   繁体   中英

Persist Tf-Idf data

I want to store the TF-IDF matrix so I don't have to recalculate it all the time. I am using scikit-learn's TfIdfVectorizer . Is it more efficient to pickle it or store it in a database?

Some context: I am using k-means clustering to provide document recommendation. Since new documents are added frequently, I would like to store the TF-IDF values of the documents so that I can recalculate the clusters.

Pickling (especially using joblib.dump ) is good for short term storage, eg to save a partial results in an interactive session or ship a model from a development server to a production server.

However the pickling format is dependent on the class definitions of the models that might change from one version of scikit-learn to another.

I would recommend to write your own implementation-independant persistence model if you plan to keep the model for a long time and make it possible to load it in future versions of scikit-learn.

I would also recommend to use the HDF5 file format (for instance used in PyTables) or other database systems that have some kind of support for storing numerical arrays efficiently.

Also have a look at the internal CSR and COO datastructures for sparse matrix representation of scipy.sparse to come up with an efficient way to store those in a database.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM