简体   繁体   English

是否可以使用sklearn在大数据文件上应用在线算法?

[英]Possibility to apply online algorithms on big data files with sklearn?

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora. 我想在大文本语料库中应用快速在线降维技术,如(在线/小批量)字典学习。 My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory. 我的输入数据自然不适合内存(这就是我想使用在线算法的原因)所以我正在寻找一种可以迭代文件而不是将所有内容加载到内存中的实现。 Is it possible to do this with sklearn ? 是否可以使用sklearn执行此操作? are there alternatives ? 有替代品吗?

Thanks register 谢谢注册

For some algorithms supporting partial_fit , it would be possible to write an outer loop in a script to do out-of-core, large scale text classification. 对于一些支持partial_fit算法,可以在脚本中编写外部循环来进行核外的大规模文本分类。 However there are some missing elements: a dataset reader that iterates over the data on the disk as folders of flat files or a SQL database server, or NoSQL store or a Solr index with stored fields for instance. 但是,有一些缺少的元素:数据集读取器,它将磁盘上的数据作为平面文件或SQL数据库服务器的文件夹进行迭代,或者NoSQL存储或带有存储字段的Solr索引。 We also lack an online text vectorizer. 我们也缺少在线文本矢量化器。

Here is a sample integration template to explain how it would fit together. 这是一个示例集成模板,用于解释它如何组合在一起。

import numpy as np
from sklearn.linear_model import Perceptron

from mymodule import SomeTextDocumentVectorizer
from mymodule import DataSetReader

dataset_reader = DataSetReader('/path/to/raw/data')

expected_classes = dataset_reader.get_all_classes()  # need to know the possible classes ahead of time

feature_extractor = SomeTextDocumentVectorizer()
classifier = Perceptron()

dataset_reader = DataSetReader('/path/to/raw/data')

for i, (documents, labels) in enumerate(dataset_reader.iter_chunks()):

    vectors = feature_extractor.transform(documents)
    classifier.partial_fit(vectors, labels, classes=expected_classes)

    if i % 100 == 0:
        # dump model to be able to monitor quality and later analyse convergence externally
        joblib.dump(classifier, 'model_%04d.pkl' % i)

The dataset reader class is application specific and will probably never make it into scikit-learn (except maybe for a folder of flat text files or CSV files that would not require to add a new dependency to the library). 数据集读取器类是特定于应用程序的,并且可能永远不会使其成为scikit-learn(除了平面文本文件或CSV文件的文件夹,不需要向库添加新的依赖项)。

The text vectorizer part is more problematic. 文本矢量化器部分更成问题。 The current vectorizer does not have a partial_fit method because of the way we build the in-memory vocabulary (a python dict that is trimmed depending on max_df and min_df). 当前矢量化器没有partial_fit方法,因为我们构建内存中词汇表的方式(根据max_df和min_df修剪的python dict)。 We could maybe build one using an external store and drop the max_df and min_df features. 我们可以使用外部存储构建一个并删除max_df和min_df功能。

Alternatively we could build an HashingTextVectorizer that would use the hashing trick to drop the dictionary requirements. 或者,我们可以构建一个HashingTextVectorizer,它将使用散列技巧来删除字典要求。 None of those exist at the moment (although we already have some building blocks such as a murmurhash wrapper and a pull request for hashing features). 目前这些都不存在(尽管我们已经有一些构建块,例如murmurhash包装器和对哈希特征的拉取请求)。

In the mean time I would advise you to have a look at Vowpal Wabbit and maybe those python bindings . 与此同时,我建议你看看Vowpal Wabbit ,也许是那些python绑定

Edit: The sklearn.feature_extraction.FeatureHasher class has been merged into the master branch of scikit-learn and will be available in the next release (0.13). 编辑: sklearn.feature_extraction.FeatureHasher类已合并到scikit-learn的主分支中,并将在下一个版本中提供(0.13)。 Have a look at the documentation on feature extraction . 查看有关特征提取的文档。

Edit 2: 0.13 is now released with both FeatureHasher and HashingVectorizer that can directly deal with text data. 编辑2: 0.13现在与FeatureHasherHashingVectorizer一起发布,可以直接处理文本数据。

Edit 3: there is now an example on out-of-core learning with the Reuters dataset in the official example gallery of the project. 编辑3:现在有一个关于在项目的官方示例库中使用路透社数据集进行核外学习的示例。

Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer . 由于Sklearn 0.13有确实的实现HashingVectorizer

EDIT: Here is a full-fledged example of such an application 编辑:这是一个这样的应用程序的完整示例

Basically, this example demonstrates that you can learn (eg classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...). 基本上,这个例子表明你可以学习(例如分类文本)数据,这些数据不适合计算机的主存储器(而是磁盘/网络/ ......)。

除了Vowpal Wabbit之外, gensim也可能很有趣 - 它也具有在线Latent Dirichlet分配功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM