[英]Create a new Spark Dataset<Row> based on an existing Dataset<Row> and an added HashMap
I have a Dataset<Row>
based on JSON-data. 我有一个基于JSON数据的
Dataset<Row>
。 Now I would like to create a new Dataset<Row>
based on the initial dataset BUT with a column added based on a Java HashMap<String, String>
datatype something like 现在,我想基于初始数据集BUT创建一个新的
Dataset<Row>
,并添加一个基于Java HashMap<String, String>
数据类型的列,例如
Dataset<Row> dataset2 = dataset1.withColumn("newColumn", *some way to specify HashMap<String, String> as the added column's datatype*);
Using this new dataset I could create a row-encoder such as 使用这个新的数据集,我可以创建一个行编码器,例如
ExpressionEncoder<Row> dataset2Encoder = RowEncoder.apply(dataset2.schema());
and then apply a map-function such as 然后应用诸如
dataset2 = dataset2.map(new XyzFunction(), dataset2Encoder)
CLARIFICATION My initial dataset is based on data in JSON-format. 澄清我的初始数据集基于JSON格式的数据。 What I'm trying to accomplish is to create a new dataset based on this initial dataset BUT with a new column added in the MapFunction.
我要完成的工作是基于此初始数据集BUT创建一个新数据集,并在MapFunction中添加新列。 The thought of adding the column (withColumn) when creating the initial dataset would make sure that a schema definition would exist for the column I'd like to update in the MapFunction.
在创建初始数据集时添加列(withColumn)的想法将确保要在MapFunction中更新的列存在模式定义。 However, I can't seem to find a way of modifying the Row object being passed to the call(Row arg) function of the MapFunction class OR create a new instance using RowFactory.create(...) in the call function.
但是,我似乎找不到一种方法来修改传递给MapFunction类的call(Row arg)函数的Row对象,或在调用函数中使用RowFactory.create(...)创建新实例。 I'd like to be able to create a Row-instance in the MapFunction based on all the existing values of the passed Row-object AND a new Map to be added to the new row.
我希望能够基于传递的Row对象的所有现有值以及要添加到新行的新Map来在MapFunction中创建Row实例。 The encoder would then know about this new/generated column from the generated schema.
然后,编码器将从生成的架构中知道此新的/生成的列。 I hope this clarifies what I'm trying to accomplish...
我希望这可以澄清我要完成的工作...
You can 您可以
import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.types.DataTypes;
df.withColumn("newColumn", lit(null).cast("map<string, string>"));
or 要么
df.withColumn(
"newColumn",
lit(null).cast(
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)
)
);
but why such indirection? 但是为什么这样的间接?
Encoder<Row> enc = RowEncoder.apply(df.schema().add(
"newColumn",
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)
));
Depending on what you exactly try to do, using an UserDefinedFunction
might much simpler, and allow you to skip Encoders
completely. 根据您确切尝试执行的操作,使用
UserDefinedFunction
可能会更简单,并且可以完全跳过Encoders
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.