简体   繁体   English

使用hex.genmodel API使用H2O MOJO模型进行预测

[英]Predict with H2O MOJO Model using hex.genmodel API

I'm currently trying to figure out how I can load a saved H2O MOJO model and use it on a Spark DataFrame without needing Sparkling Water. 我目前正在尝试弄清楚如何加载已保存的H2O MOJO模型并在Spark DataFrame上使用它而无需使用苏打水。 The approach I am trying to use is to load up a h2o-genmodel.jar file when Spark starts up, and then use then use PySpark's Py4J interface to access it. 我尝试使用的方法是在Spark启动时加载h2o-genmodel.jar文件,然后使用PySpark的Py4J接口访问它。 My concrete question will be about how access the values generated by the py4j.java_gateway objects. 我的具体问题是关于如何访问py4j.java_gateway对象生成的值。

Below is a minimal example: 下面是一个最小的示例:

Train model 火车模型

import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
import pandas as pd
import numpy as np

h2o.init()

features = pd.DataFrame(np.random.randn(6,3),columns=list('ABC'))
target = pd.DataFrame(pd.Series(["cat","dog","cat","dog","cat","dog"]), columns=["target"])
df = pd.concat([features, target], axis=1)
df_h2o = h2o.H2OFrame(df)

rf = H2ORandomForestEstimator()
rf.train(["A","B","C"],"target",training_frame=df_h2o, validation_frame=df_h2o)

Save MOJO 保存MOJO

model_path = rf.download_mojo(path="./mojo/", get_genmodel_jar=True)
print(model_path)

Load MOJO 加载MOJO

from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.jars", "/home/ec2-user/Notebooks/mojo/h2o-genmodel.jar").getOrCreate()

MojoModel = spark._jvm.hex.genmodel.MojoModel
EasyPredictModelWrapper = spark._jvm.hex.genmodel.easy.EasyPredictModelWrapper
RowData = spark._jvm.hex.genmodel.easy.RowData

mojo = MojoModel.load(model_path)
easy_model = EasyPredictModelWrapper(mojo)

Predict on a single row of data 预测单行数据

r = RowData()
r.put("A", -0.631123)
r.put("B", 0.711463)
r.put("C", -1.332257)

score = easy_model.predictBinomial(r).classProbabilities

So, that far I have been able to get. 因此,到目前为止,我已经能够做到。 Where I am having trouble is that I find it difficult to inpect what score is giving back to me. 我遇到麻烦的地方是,我很难判断什么score能给我带来回报。 print(score) yields the following: <py4j.java_gateway.JavaMember at 0x7fb2e09b4e80> . print(score)产生以下内容: <py4j.java_gateway.JavaMember at 0x7fb2e09b4e80> Presumably there must be a way to the actual generated values from this object, but how would I do that? 大概必须有一种方法可以从该对象实际生成值,但是我该怎么做呢?

You can find the returned object here . 您可以在此处找到返回的对象。 classProbabilities is a Java array and Java arrays do not have the toString method, which is why your print statement is returning something non-human-readable. classProbabilities是一个Java数组,并且Java数组没有toString方法,这就是为什么您的print语句返回不易阅读的内容的原因。

One way to access this value would be to use py4j 一种访问此值的方法是使用py4j

for example this should work: 例如,这应该工作:

for i in easy_model.predictBinomial(r).classProbabilities:
...     print(i)

or you can covert it to a list. 或者您可以将其隐藏到列表中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM