简体   繁体   中英

how to fit and score a machine learning models in Java/JVM based application

Could you please guide me on how to create and execute a machine learning models/statistical models (regression, Decision tree, K means clustering, Naive bayes, scorecard/linear/logistic regression etc. and GBM, GLM ) in Java/JVM based application (in production).

We have an ETL sort of Java based product where one can do most of data Preparation steps for machine learning, like data ingestion from JDBC, files, HDFS, No SQL etc., joins and aggregations etc.(which are required for Feature engineering) and now we want to add Analytics capabilities using machine learning/statistical modeling.

Right now, we are using JPMML- evaluator to score the models created in PMML format using R and python (and Knime) but it needs three separate and unconnected steps:- 1- first step for data preparation in our Java/JVM application and save the sampling data (training and test) data in csv file or in DB, - 2- Create a machine learning Model in R and python (and Knime) and export it in PMML 4.2 format - 3- Import/deploy the PMML in our Java based application and use JPMML evaluator to execute it in production.

I am sure it's a common problem in machine learning as generally in Production JAVA is preferred over Python or R. Could you suggest what is the better approach(s) to create as well as execute a python/scikit based machine learning model in JVM based application.

What are your thought to achieve the steps # 2 and #3 more seamlessly in a JVM based application, without compromising performance and usability:-

1- Call a java program which internally calls the python scikit script (under the hood) to create a model in PMML and then use JPMML evaluator. It will pretend to the user that he is in a single JVM based application (better usability). I am not sure what are the limitations and short coming of using PMML as not all features are supported in jpmml-sklearn. 2- Call a java program which internally calls the python script and do the model creation as well as execution in an external python environment and serialized the model and the results in a file/csv or in memory DB (or cache, like hazelcast) from where the parent Java application will fetch the results etc.. I researched that I can't use Jython for executing Sci-kit models. 3- Can I use Jep (Embed Python in Java) to embed Cpython in JVM ? Does anybody tried it for sci-kit models?

Alternatively, I should explore to use Mahout or weka - java based machine learning libraries in my JVM based application. (I need to support both windows and non-windows platforms)

I am also exploring H2Oai which is java based. Does anybody tried it.

I use IntelliJ IDEA with the python plugin. This way I have both java and python code in one and the same project. The data is in the database; the connection is always visible and accessible, independently of whether I have a .java or a .py file currently in the editor. In the list of configurations you can have Python scripts, Java applications, maven goals etc. Therefore I don't think you have to mix Python and Java code together (by calling Python scripts out of Java). That is completely unnecessary.

My workflow is (everything in IntelliJ IDEA): 1. Prepare the data (usually SQL) 2. Run python script, which applies a pipeline of transformators to the pandas data frame constructed from a certain database table and outputs a PMML. 3. Use the scikit-learn model in your java application.

If you have an ETL with HDFS backend, I would suggest deploying Spark on the cluster and using Spark's MLib machine learning algorithms. They support the methods you mentioned above.

Do you mind giving some context as to what the size (rows, columns, type) of the data that you plan to work with? Java would not be my recommended goto-language for ML but Scala compiles to JVM bytecode and has a similar syntax to java (in addition to having a Java API).

If you're producing a proof-of-concept, then Java is fine but if you're planning on working with big data, it doesn't really scale well.

I have found a decent solution for my problem. I am using H2O.ai developed in Java for scalable machine learning using open source. It offers APIs in Java (Restful API), Python, R and Scala. It has best of class algorithms for classification, Regression, Clustering etc. and seamlessly integrates with Apache Hadoop and Spark (sparkling-water) as well, if someone has Spark cluster. It also offers a deep learning algorithm which is based on a multi-layer feedforward artificial neural network. I am using Java binding API/Rest API and sometimes the low-level H2o API (for h2o 3 nodes cluster management).

I come across another java based alternative, called Smile - Statistical Machine Intelligence and Learning Engine which provides regression, classification, clustering, association rule mining, feature selection etc. Does anybody have more feedback on these or similar Java based ML library?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM