简体繁体中英

Exporting a Scikit Learn Random Forest for use on Hadoop Platform

原文 2014-06-13 19:28:11 5 1 python/ hadoop/ machine-learning/ scikit-learn/ pmml

I've developed a spam classifier using pandas and scikit learn to the point where it's ready for integration into our hadoop-based system. To this end, I need to export my classifier to a more common format than pickling.

The Predictive Model Markup Language (PMML) is my preferred export format. It plays exceedingly well with Cascading, which we already use. However, I surprisingly cannot find any python libraries that export scikit-learn models into PMML.

Has anyone had experience with this use case? Is there any sort of alternative to PMML that would lend interoperability between scikit-learn and hadoop? What about a solid PMML export library?

1 answers

You could use Py2PMML to export the model to PMML and then evaluate it on Hadoop using JPMML-Cascading . JPMML is open source but Py2PMML from Zementis seems to be a commercial product. Besides this alternative there are no other tools to score Scikit models exported as PMML on Java/Hadoop. The core scikit team is planning to implement a PMML exporter though. But if you don't want any commercial solutions or wait for such tool to be implemented you still have some options but they require some coding:

Adapt the SKLearn Compiled trees project so it generates Java/MapReduce code instead of C.
Using the export_graphviz function obtain the DOT representation of each decision tree and write a small Java interpreter.
Forget about Java and Hadoop, use Apache Spark and evaluate each one of the decision trees in parallel using Python, Scikit and PySpark.

Hope it helps!

Scikit learn - How to use SVM and Random Forest for text classification?

input for scikit-learn random forest

scikit learn Random Forest Classifier probability threshold

random forest with characters in scikit-learn/python

The number of bootstraps in Random Forest (scikit-learn)

scikit-learn random forest: severe overfitting?

Combining random forest models in scikit learn

ValueError : Random forest classification by scikit learn

Random Forest interpretation in scikit-learn

How to use dummy variable to represent categorical data in python scikit-learn random forest

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Scikit learn - How to use SVM and Random Forest for text classification? input for scikit-learn random forest scikit learn Random Forest Classifier probability threshold random forest with characters in scikit-learn/python The number of bootstraps in Random Forest (scikit-learn) scikit-learn random forest: severe overfitting? Combining random forest models in scikit learn ValueError : Random forest classification by scikit learn Random Forest interpretation in scikit-learn How to use dummy variable to represent categorical data in python scikit-learn random forest

Related Tags

Exporting a Scikit Learn Random Forest for use on Hadoop Platform

Question

1 answers

solution1 9 ACCPTED 2014-06-13 22:54:05

solution1
9 ACCPTED 2014-06-13 22:54:05