创建一个映射以为Spark Dataframe的每一行调用POJO

Question

I built an H2O model in R and saved the POJO code. 我在R中建立了H2O模型并保存了POJO代码。 I want to score parquet files in hdfs using the POJO but I'm not sure how to go about it. 我想使用POJO在hdfs中对实木复合地板文件进行评分，但是我不确定该如何处理。 I plan on reading the parquet files into spark (scala/SparkR/PySpark) and scoring them on there. 我计划将实木复合地板文件读入spark（scala / SparkR / PySpark）并在其中评分。 Below is the excerpt I found on H2O's documentation page. 以下是我在H2O文档页面上找到的摘录。

"How do I run a POJO on a Spark Cluster? “如何在Spark集群上运行POJO？

The POJO provides just the math logic to do predictions, so you won't find any Spark (or even H2O) specific code there. POJO仅提供进行预测的数学逻辑，因此您不会在其中找到任何特定于Spark（甚至是H2O）的代码。 If you want to use the POJO to make predictions on a dataset in Spark, create a map to call the POJO for each row and save the result to a new column, row-by-row" 如果要使用POJO对Spark中的数据集进行预测，请创建一个地图以针对每一行调用POJO，然后将结果逐行保存到新列中。

Does anyone have some example code of how I can do this? 有人有一些示例代码说明我该怎么做吗？ I'd greatly appreciate any assistance. 非常感谢您的协助。 I code primarily in R and SparkR, and I'm not sure how I can "map" the POJO to each line. 我主要使用R和SparkR进行编码，但不确定如何将POJO映射到每一行。

Thanks in advance. 提前致谢。

Answer 1

I just posted a solution that actually uses DataFrame/Dataset. 我刚刚发布了一个实际上使用DataFrame / Dataset的解决方案。 The post used a Star Wars dataset to build a model in R and then scored MOJO on the test set in Spark. 该帖子使用《星球大战》数据集在R中建立模型，然后在Spark中的测试集上获得MOJO评分。 I'll paste the only relevant part here: 我将在此处粘贴唯一相关的部分：

Scoring with Spark (and Scala) Spark（和Scala）得分

You could either use spark-submit or spark-shell. 您可以使用spark-submit或spark-shell。 If you use spark-submit, h2o-genmodel.jar needs to be put under lib folder of the root directory of your spark application so it could be added as a dependency during compilation. 如果使用spark-submit，则需要将h2o-genmodel.jar放置在spark应用程序根目录的lib文件夹下，以便可以在编译过程中将其作为依赖项添加。 The following code assumes you're running spark-shell. 以下代码假定您正在运行spark-shell。 In order to use h2o-genmodel.jar, you need to append the jar file when launching spark-shell by providing a --jar flag. 为了使用h2o-genmodel.jar，您需要在启动spark-shell时通过提供--jar标志来附加jar文件。 For example: 例如：

/usr/lib/spark/bin/spark-shell \
--conf spark.serializer="org.apache.spark.serializer.KryoSerializer" \
--conf spark.driver.memory="3g" \
--conf spark.executor.memory="10g" \
--conf spark.executor.instances=10 \
--conf spark.executor.cores=4 \
--jars /path/to/h2o-genmodel.jar

Now in the Spark shell, import the dependencies 现在在Spark shell中，导入依赖项

import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.MojoModel

Using DataFrame 使用DataFrame

val modelPath = "/path/to/zip/file"
val dataPath = "/path/to/test/data"

// Import data
val dfStarWars = spark.read.option("header", "true").csv(dataPath)
// Import MOJO model
val mojo = MojoModel.load(modelPath)
val easyModel = new EasyPredictModelWrapper(mojo)

// score
val dfScore = dfStarWars.map {
  x =>
    val r = new RowData
    r.put("height", x.getAs[String](1))
    r.put("mass", x.getAs[String](2))
    val score = easyModel.predictBinomial(r).classProbabilities
    (x.getAs[String](0), score(1))
}.toDF("name", "isHumanScore")

The variable score is a list of two scores for level 0 and 1. score(1) is the score for level 1, which is "human". 可变分数是级别0和1的两个分数的列表。score（1）是级别1的分数，即“人类”。 By default the map function returns a DataFrame with unspecified column names "_1", "_2", etc. You can rename the columns by calling toDF. 默认情况下，map函数返回一个带有未指定列名“ _1”，“ _ 2”等的DataFrame。您可以通过调用toDF重命名这些列。

Using Dataset 使用数据集

To use the Dataset API we just need to create two case classes, one for the input data, and one for the output. 要使用Dataset API，我们只需要创建两个case类，一个用于输入数据，一个用于输出。

case class StarWars (
  name: String,
  height: String,
  mass: String,
  is_human: String
)

case class Score (
  name: String,
  isHumanScore: Double
)


// Dataset
val dtStarWars = dfStarWars.as[StarWars]
val dtScore = dtStarWars.map {
  x =>
    val r = new RowData
    r.put("height", x.height)
    r.put("mass", x.mass)
    val score = easyModel.predictBinomial(r).classProbabilities
    Score(x.name, score(1))
}

With Dataset you can get the value of a column by calling x.columnName directly. 使用数据集，您可以通过直接调用x.columnName来获取列的值。 Just notice that the types of the column values have to be String, so you might need to manually cast them if they are of other types defined in the case class. 请注意，列值的类型必须为String，因此，如果它们是case类中定义的其他类型，则可能需要手动强制转换它们。

Answer 2

If you want to perform scoring with POJO or MOJO in spark you should be using RowData which is provided within h2o-genmodel.jar class as row by row input data to call easyPredict method to generate scores. 如果要在Spark中使用POJO或MOJO进行评分，则应使用h2o-genmodel.jar类中提供的RowData作为逐行输入数据，以调用easyPredict方法生成分数。

Your solution will be to read the parquet file from HDFS and then for each row, convert that to RowData object by filling each entry and then pass that to your POJO scoring function. 您的解决方案是从HDFS读取镶木地板文件，然后对于每一行，通过填充每个条目将其转换为RowData对象，然后将其传递给POJO评分功能。 Remember POJO and MOJO they both use exact same scoring function to score and the only difference is on how the POJO Class is used vs MOJO resources zip package is used. 请记住，POJO和MOJO都使用完全相同的评分功能进行评分，唯一的区别在于POJO类的使用方式与MOJO资源zip包的使用方式。 As MOJO are backward compatible and could work with any newer h2o-genmodel.jar it is best if you use MOJO instead of POJO. 由于MOJO向后兼容，并且可以与任何较新的h2o-genmodel.jar一起使用，因此最好使用MOJO而不是POJO。

Following is the full Scala code you can use on Spark to load a MOJO model and then do the scoring: 以下是您可以在Spark上使用的完整Scala代码，以加载MOJO模型，然后进行评分：

import _root_.hex.genmodel.GenModel
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.easy.prediction
import _root_.hex.genmodel.MojoModel
import _root_.hex.genmodel.easy.RowData

// Load Mojo
val mojo = MojoModel.load("/Users/avkashchauhan/learn/customers/mojo_bin/gbm_model.zip")
val easyModel = new EasyPredictModelWrapper(mojo)

// Get Mojo Details
var features = mojo.getNames.toBuffer

// Creating the row
val r = new RowData
r.put("AGE", "68")
r.put("RACE", "2")
r.put("DCAPS", "2")
r.put("VOL", "0")
r.put("GLEASON", "6")

// Performing the Prediction
val prediction = easyModel.predictBinomial(r).classProbabilities

Here is an example of reading parquet files in Spark and then saving as CSV. 这是读取Spark中的实木复合地板文件然后另存为CSV的示例。 You can use the same code to read the parquet from HDFS and then pass the each row as RowData to above example. 您可以使用相同的代码从HDFS读取实木复合地板，然后将每一行作为RowData传递到上面的示例。

Here is detailed example of using MOJO model in spark and perform scoring using RowData. 这是在spark中使用MOJO模型并使用RowData进行评分的详细示例。

创建一个映射以为Spark Dataframe的每一行调用POJO

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-12-20 02:42:25

Scoring with Spark (and Scala) Spark（和Scala）得分

Using DataFrame 使用DataFrame

Using Dataset 使用数据集

解决方案2
1 2017-10-20 16:09:55

创建一个映射以为Spark Dataframe的每一行调用POJO

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-12-20 02:42:25

Scoring with Spark (and Scala) Spark（和Scala）得分

Using DataFrame 使用DataFrame

Using Dataset 使用数据集

解决方案2 1 2017-10-20 16:09:55

解决方案1
2 已采纳 2017-12-20 02:42:25

解决方案2
1 2017-10-20 16:09:55