简体   繁体   中英

fit in distributed, predict in a stand alone

How can one train (fit) a model in a distributed big data platform (eg Apache Spark) yet use that model in a stand alone machine (eg JVM) with as little dependency as possible?

I heard of PMML yet I am not sure if it is enough. Also Spark 2.0 supports persistent model saving yet I am not sure what is necessary to load and run those models.

Apache Spark persistence is about saving and loading Spark ML pipelines in JSON data format (think of it as Python's pickle mechanism, or R's RDS mechanism). These JSON data structures map to Spark ML classes. They don't make sense on other platforms.

As for PMML, then you can convert Spark ML pipelines to PMML documents using the JPMML-SparkML library. You can execute PMML documents (doesn't matter whether they came from Apache Spark, Python or R) using the JPMML-Evaluator library. If you're using Apache Maven to manage and build your project, then JPMML-Evaluator can be included by adding just one dependency declaration to your project's POM.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM