简体   繁体   English

导出Scikit学习随机森林以在Hadoop平台上使用

[英]Exporting a Scikit Learn Random Forest for use on Hadoop Platform

I've developed a spam classifier using pandas and scikit learn to the point where it's ready for integration into our hadoop-based system. 我已经开发了一个使用pandas和scikit的垃圾邮件分类器,以便能够集成到基于hadoop的系统中。 To this end, I need to export my classifier to a more common format than pickling. 为此,我需要将分类器导出为比酸洗更常见的格式。

The Predictive Model Markup Language (PMML) is my preferred export format. 预测模型标记语言(PMML)是我首选的导出格式。 It plays exceedingly well with Cascading, which we already use. 它与我们已经使用的Cascading非常匹配。 However, I surprisingly cannot find any python libraries that export scikit-learn models into PMML. 但是,我出乎意料地找不到任何将scikit-learn模型导出到PMML中的python库。

Has anyone had experience with this use case? 有没有人有过这个用例的经验? Is there any sort of alternative to PMML that would lend interoperability between scikit-learn and hadoop? 是否有任何替代PMML可以提供scikit-learn和hadoop之间的互操作性? What about a solid PMML export library? 固态PMML导出库怎么样?

You could use Py2PMML to export the model to PMML and then evaluate it on Hadoop using JPMML-Cascading . 您可以使用Py2PMML将模型导出为PMML,然后使用JPMML-Cascading在Hadoop上对其进行评估。 JPMML is open source but Py2PMML from Zementis seems to be a commercial product. JPMML是开源的,但Zementis的Py2PMML似乎是一种商业产品。 Besides this alternative there are no other tools to score Scikit models exported as PMML on Java/Hadoop. 除了这个替代方案之外,没有其他工具可以对在Java / Hadoop上作为PMML导出的Scikit模型进行评分。 The core scikit team is planning to implement a PMML exporter though. 核心scikit团队正计划实施PMML出口商。 But if you don't want any commercial solutions or wait for such tool to be implemented you still have some options but they require some coding: 但是,如果您不想要任何商业解决方案或等待实施此类工具,您仍然有一些选择,但它们需要一些编码:

  • Adapt the SKLearn Compiled trees project so it generates Java/MapReduce code instead of C. 调整SKLearn Compiled树项目,使其生成Java / MapReduce代码而不是C.
  • Using the export_graphviz function obtain the DOT representation of each decision tree and write a small Java interpreter. 使用export_graphviz函数获取每个决策树的DOT表示并编写一个小型Java解释器。
  • Forget about Java and Hadoop, use Apache Spark and evaluate each one of the decision trees in parallel using Python, Scikit and PySpark. 忘记Java和Hadoop,使用Apache Spark并使用Python,Scikit和PySpark并行评估每个决策树。

Hope it helps! 希望能帮助到你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM