简体   繁体   English

如何在Spark 2.3的数据集中开发Function1 / MapFunction接口的实现 <Row> 被突变

[英]How to develop implementation of the Function1/MapFunction interfaces in Spark 2.3 where the Dataset<Row> are mutated

Which is the preferred way to implement a class based on the Function1/MapFunction interfaces in Spark 2.3, where the class will mutate the individual rows schema? 在Spark 2.3中基于Function1 / MapFunction接口实现类的首选方法是哪一种,该类将使单个行模式发生突变? Ultimately every row's schema might become different depending on the result of different look-ups. 最终,取决于不同查找的结果, 每一行的架构可能会有所不同。

Something like: 就像是:

public class XyzProcessor implements Function1<Row, Row> {
...
    public Row call(Row row) throws Exception {
        /// The `row` schema will be changed here...
        return row;
    }
...

The .map method of the Dataset will be called as: 数据集的.map方法将被称为:

ExpressionEncoder<Row> rowEncoder = RowEncoder.apply(foo.schema());
dataset.map(new XyzProcessor(), rowEncoder);

The "problem" is that the XyzProcessor will alter the schema by adding columns to the row thus rendering the rowEncoder in a faulty state schema wise. “问题”是XyzProcessor将通过向行添加列来更改架构,从而使rowEncoder处于错误的状态架构。 How is the preferred way to deal with this? 首选的处理方式是什么?

Is this the right way to accomplish Dataset modifications? 这是完成数据集修改的正确方法吗?

There is a conceptual error in your design: 您的设计中存在概念错误:

Ultimately every row's schema might become different depending on the result of different look-ups. 最终,取决于不同查找的结果,每一行的架构可能会有所不同。

Schema in Spark SQL has to be fixed. Spark SQL中的模式必须修复。 In the worst case scenario it can be serialized BLOB , but it has to be consistent across all rows. 在最坏的情况下, 它可以序列化BLOB ,但是必须在所有行上保持一致。

You have to go back to the blackboard, and redesign your process. 您必须回到黑板上,重新设计流程。 If output is type compatible (there are no conflicts for (path, type) tuples) then making remaining fields nullable should solve your problem. 如果输出是类型兼容的((元组,(路径,类型)元组没有冲突)),那么使其余字段nullable应该可以解决您的问题。

If not, I would go with RDDs which support proper type hierarchies. 如果没有,我将使用支持适当类型层次结构的RDDs

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM