简体繁体 English

坚持Tf-Idf数据

[英]Persist Tf-Idf data

原文 2012-06-19 13:50:20 6 1 python/ machine-learning/ scikit-learn/ pickle

I want to store the TF-IDF matrix so I don't have to recalculate it all the time. 我想存储TF-IDF矩阵，所以我不必一直重新计算它。 I am using scikit-learn's TfIdfVectorizer . 我正在使用scikit-learn的TfIdfVectorizer 。 Is it more efficient to pickle it or store it in a database? 腌制它或将其存储在数据库中是否更有效？

Some context: I am using k-means clustering to provide document recommendation. 一些上下文：我使用k-means聚类来提供文档推荐。 Since new documents are added frequently, I would like to store the TF-IDF values of the documents so that I can recalculate the clusters. 由于经常添加新文档，我想存储文档的TF-IDF值，以便我可以重新计算集群。

1 个解决方案

Pickling (especially using joblib.dump ) is good for short term storage, eg to save a partial results in an interactive session or ship a model from a development server to a production server. Pickling （特别是使用joblib.dump ）适用于短期存储，例如将部分结果保存在交互式会话中或将模型从开发服务器发送到生产服务器。

However the pickling format is dependent on the class definitions of the models that might change from one version of scikit-learn to another. 但是，酸洗格式取决于模型的类定义，这些定义可能会从一个版本的scikit-learn变为另一个版本。

I would recommend to write your own implementation-independant persistence model if you plan to keep the model for a long time and make it possible to load it in future versions of scikit-learn. 如果您打算长时间保留模型并且可以在未来版本的scikit-learn中加载它，我建议编写您自己的独立实现持久性模型。

I would also recommend to use the HDF5 file format (for instance used in PyTables) or other database systems that have some kind of support for storing numerical arrays efficiently. 我还建议使用HDF5文件格式（例如在PyTables中使用）或其他数据库系统，它们有效地支持存储数值数组。

Also have a look at the internal CSR and COO datastructures for sparse matrix representation of scipy.sparse to come up with an efficient way to store those in a database. 还要看一下scipy.sparse的稀疏矩阵表示的内部CSR和COO数据结构，以便提供一种在数据库中存储它们的有效方法。