简体   繁体   English

创建一个新的Spark数据集 <Row> 基于现有数据集 <Row> 和一个附加的HashMap

[英]Create a new Spark Dataset<Row> based on an existing Dataset<Row> and an added HashMap

I have a Dataset<Row> based on JSON-data. 我有一个基于JSON数据的Dataset<Row> Now I would like to create a new Dataset<Row> based on the initial dataset BUT with a column added based on a Java HashMap<String, String> datatype something like 现在,我想基于初始数据集BUT创建一个新的Dataset<Row> ,并添加一个基于Java HashMap<String, String>数据类型的列,例如

Dataset<Row> dataset2 = dataset1.withColumn("newColumn", *some way to specify HashMap<String, String> as the added column's datatype*);

Using this new dataset I could create a row-encoder such as 使用这个新的数据集,我可以创建一个行编码器,例如

ExpressionEncoder<Row> dataset2Encoder = RowEncoder.apply(dataset2.schema());

and then apply a map-function such as 然后应用诸如

dataset2 = dataset2.map(new XyzFunction(), dataset2Encoder)

CLARIFICATION My initial dataset is based on data in JSON-format. 澄清我的初始数据集基于JSON格式的数据。 What I'm trying to accomplish is to create a new dataset based on this initial dataset BUT with a new column added in the MapFunction. 我要完成的工作是基于此初始数据集BUT创建一个新数据集,并在MapFunction中添加新列。 The thought of adding the column (withColumn) when creating the initial dataset would make sure that a schema definition would exist for the column I'd like to update in the MapFunction. 在创建初始数据集时添加列(withColumn)的想法将确保要在MapFunction中更新的列存在模式定义。 However, I can't seem to find a way of modifying the Row object being passed to the call(Row arg) function of the MapFunction class OR create a new instance using RowFactory.create(...) in the call function. 但是,我似乎找不到一种方法来修改传递给MapFunction类的call(Row arg)函数的Row对象,或在调用函数中使用RowFactory.create(...)创建新实例。 I'd like to be able to create a Row-instance in the MapFunction based on all the existing values of the passed Row-object AND a new Map to be added to the new row. 我希望能够基于传递的Row对象的所有现有值以及要添加到新行的新Map来在MapFunction中创建Row实例。 The encoder would then know about this new/generated column from the generated schema. 然后,编码器将从生成的架构中知道此新的/生成的列。 I hope this clarifies what I'm trying to accomplish... 我希望这可以澄清我要完成的工作...

You can 您可以

import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.types.DataTypes;

df.withColumn("newColumn", lit(null).cast("map<string, string>"));

or 要么

df.withColumn(
  "newColumn", 
  lit(null).cast(
    DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)
  )
);

but why such indirection? 但是为什么这样的间接?

Encoder<Row> enc = RowEncoder.apply(df.schema().add(
  "newColumn",
  DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)
));

Depending on what you exactly try to do, using an UserDefinedFunction might much simpler, and allow you to skip Encoders completely. 根据您确切尝试执行的操作,使用UserDefinedFunction可能会更简单,并且可以完全跳过Encoders

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM