简体繁体 English

导出Scikit学习随机森林以在Hadoop平台上使用

[英]Exporting a Scikit Learn Random Forest for use on Hadoop Platform

原文 2014-06-13 19:28:11 8 1 python/ hadoop/ machine-learning/ scikit-learn/ pmml

I've developed a spam classifier using pandas and scikit learn to the point where it's ready for integration into our hadoop-based system. 我已经开发了一个使用pandas和scikit的垃圾邮件分类器，以便能够集成到基于hadoop的系统中。 To this end, I need to export my classifier to a more common format than pickling. 为此，我需要将分类器导出为比酸洗更常见的格式。

The Predictive Model Markup Language (PMML) is my preferred export format. 预测模型标记语言（PMML）是我首选的导出格式。 It plays exceedingly well with Cascading, which we already use. 它与我们已经使用的Cascading非常匹配。 However, I surprisingly cannot find any python libraries that export scikit-learn models into PMML. 但是，我出乎意料地找不到任何将scikit-learn模型导出到PMML中的python库。

Has anyone had experience with this use case? 有没有人有过这个用例的经验？ Is there any sort of alternative to PMML that would lend interoperability between scikit-learn and hadoop? 是否有任何替代PMML可以提供scikit-learn和hadoop之间的互操作性？ What about a solid PMML export library? 固态PMML导出库怎么样？

1 个解决方案

You could use Py2PMML to export the model to PMML and then evaluate it on Hadoop using JPMML-Cascading . 您可以使用Py2PMML将模型导出为PMML，然后使用JPMML-Cascading在Hadoop上对其进行评估。 JPMML is open source but Py2PMML from Zementis seems to be a commercial product. JPMML是开源的，但Zementis的Py2PMML似乎是一种商业产品。 Besides this alternative there are no other tools to score Scikit models exported as PMML on Java/Hadoop. 除了这个替代方案之外，没有其他工具可以对在Java / Hadoop上作为PMML导出的Scikit模型进行评分。 The core scikit team is planning to implement a PMML exporter though. 核心scikit团队正计划实施PMML出口商。 But if you don't want any commercial solutions or wait for such tool to be implemented you still have some options but they require some coding: 但是，如果您不想要任何商业解决方案或等待实施此类工具，您仍然有一些选择，但它们需要一些编码：

Adapt the SKLearn Compiled trees project so it generates Java/MapReduce code instead of C. 调整SKLearn Compiled树项目，使其生成Java / MapReduce代码而不是C.
Using the export_graphviz function obtain the DOT representation of each decision tree and write a small Java interpreter. 使用export_graphviz函数获取每个决策树的DOT表示并编写一个小型Java解释器。
Forget about Java and Hadoop, use Apache Spark and evaluate each one of the decision trees in parallel using Python, Scikit and PySpark. 忘记Java和Hadoop，使用Apache Spark并使用Python，Scikit和PySpark并行评估每个决策树。