创建一个新的Spark数据集 <Row> 基于现有数据集 <Row> 和一个附加的HashMap

Question

I have a Dataset<Row> based on JSON-data. 我有一个基于JSON数据的Dataset<Row> 。 Now I would like to create a new Dataset<Row> based on the initial dataset BUT with a column added based on a Java HashMap<String, String> datatype something like 现在，我想基于初始数据集BUT创建一个新的Dataset<Row> ，并添加一个基于Java HashMap<String, String>数据类型的列，例如

Dataset<Row> dataset2 = dataset1.withColumn("newColumn", *some way to specify HashMap<String, String> as the added column's datatype*);

Using this new dataset I could create a row-encoder such as 使用这个新的数据集，我可以创建一个行编码器，例如

ExpressionEncoder<Row> dataset2Encoder = RowEncoder.apply(dataset2.schema());

and then apply a map-function such as 然后应用诸如

dataset2 = dataset2.map(new XyzFunction(), dataset2Encoder)

CLARIFICATION My initial dataset is based on data in JSON-format. 澄清我的初始数据集基于JSON格式的数据。 What I'm trying to accomplish is to create a new dataset based on this initial dataset BUT with a new column added in the MapFunction. 我要完成的工作是基于此初始数据集BUT创建一个新数据集，并在MapFunction中添加新列。 The thought of adding the column (withColumn) when creating the initial dataset would make sure that a schema definition would exist for the column I'd like to update in the MapFunction. 在创建初始数据集时添加列（withColumn）的想法将确保要在MapFunction中更新的列存在模式定义。 However, I can't seem to find a way of modifying the Row object being passed to the call(Row arg) function of the MapFunction class OR create a new instance using RowFactory.create(...) in the call function. 但是，我似乎找不到一种方法来修改传递给MapFunction类的call（Row arg）函数的Row对象，或在调用函数中使用RowFactory.create（...）创建新实例。 I'd like to be able to create a Row-instance in the MapFunction based on all the existing values of the passed Row-object AND a new Map to be added to the new row. 我希望能够基于传递的Row对象的所有现有值以及要添加到新行的新Map来在MapFunction中创建Row实例。 The encoder would then know about this new/generated column from the generated schema. 然后，编码器将从生成的架构中知道此新的/生成的列。 I hope this clarifies what I'm trying to accomplish... 我希望这可以澄清我要完成的工作...

Answer 1

You can 您可以

import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.types.DataTypes;

df.withColumn("newColumn", lit(null).cast("map<string, string>"));

or 要么

df.withColumn(
  "newColumn", 
  lit(null).cast(
    DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)
  )
);

but why such indirection? 但是为什么这样的间接？

Encoder<Row> enc = RowEncoder.apply(df.schema().add(
  "newColumn",
  DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)
));

Depending on what you exactly try to do, using an UserDefinedFunction might much simpler, and allow you to skip Encoders completely. 根据您确切尝试执行的操作，使用UserDefinedFunction可能会更简单，并且可以完全跳过Encoders 。

创建一个新的Spark数据集 <Row> 基于现有数据集 <Row> 和一个附加的HashMap

问题描述

1 个解决方案

解决方案1
0 2018-04-22 19:06:28

创建一个新的Spark数据集 <Row> 基于现有数据集 <Row> 和一个附加的HashMap

问题描述

1 个解决方案

解决方案1 0 2018-04-22 19:06:28

解决方案1
0 2018-04-22 19:06:28