Spark 数据集：转换数据集的列

Question

This is my dataset:这是我的数据集：

  Dataset<Row> myResult = pot.select(col("number")
                    , col("document")
                    , explode(col("mask")).as("mask"));

I need to now create a new dataset from the existing myResult.我现在需要从现有的 myResult 创建一个新数据集。 something like below:如下所示：

  Dataset<Row> myResultNew = myResult.select(col("number")
                , col("name")
                , col("age")
                , col("class")
                , col("mask");

name, age and class are created from column document from Dataset myResult.姓名、年龄和 class 是从数据集 myResult 的列文档中创建的。 I guess I can call functions on the column document and then perform any operation on that.我想我可以在列文档上调用函数，然后对其执行任何操作。

myResult.select(extract(col("document")));


 private String extract(final Column document) {
        //TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
        // PARSE DOCUMENT AND GET THEM.

     XMLParser doc= (XMLParser) document // this doesnt work???????




}

My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name, age,class.我的问题是：文档是列类型，我需要将其转换为不同的 Object 类型并解析它以提取名称、年龄、class。 How can I do that.我怎样才能做到这一点。 document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML.文档是 xml，我需要进行解析以获取其他 3 列，因此无法避免将其转换为 XML。

Answer 1

Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking.将extract方法转换为UDF将是一个尽可能接近您所要求的解决方案。 An UDF can take the value of one or more columns and execute any logic with this input. UDF 可以获取一个或多个列的值并使用此输入执行任何逻辑。

import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;

[...]

UserDefinedFunction extract = udf(
        (String document) -> {
            List<String> result = new ArrayList<>();
            XMLParser doc = XMLParser.parse(document);
            String name = ... //read name from xml document
            String age = ... //read age from xml document
            String clazz = ... //read class from xml document
            result.add(name);
            result.add(age);
            result.add(clazz);
            return result;
         }, DataTypes.createArrayType(DataTypes.StringType)
);

A restriction of UDFs is that they can only return one column. UDF 的一个限制是它们只能返回一列。 Therefore the function returns a String array that has to be unpacked afterwards.因此，function 返回一个必须在之后解包的字符串数组。

Dataset<Row> myResultNew = myResult
    .withColumn("extract", extract.apply(col("document"))) //1
    .withColumn("name", col("extract").getItem(0))         //2
    .withColumn("age", col("extract").getItem(1))          //2
    .withColumn("class", col("extract").getItem(2))        //2
    .drop("document", "extract");                          //3

call the UDF and use the column that contains the xml document as parameter of the apply function调用 UDF 并使用包含 xml 文档的列作为apply function 的参数
create the result columns out of the returned array from step 1从第 1 步返回的数组中创建结果列
drop the intermediate columns删除中间列

Note : the udf is executed once per row in the dataset.注意：udf 在数据集中每行执行一次。 If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row.如果创建 xml 解析器的成本很高，这可能会减慢 Spark 作业的执行速度，因为每行实例化一个解析器。 Due to the parallel nature of Spark it is not possible to reuse the parser for the next row.由于 Spark 的并行特性，不可能为下一行重用解析器。 If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions .如果这是一个问题，另一个（至少在 Java 世界稍微复杂一些）选项将是使用mapPartitions 。 Here one would not need one parser per row but only one parser per partition of the dataset.在这里，每行不需要一个解析器，而数据集的每个分区只需要一个解析器。

A completely different approach would be to use spark-xml .一种完全不同的方法是使用spark-xml 。

Spark 数据集：转换数据集的列

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-23 20:23:52

Spark 数据集：转换数据集的列

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-23 20:23:52

解决方案1
1 已采纳 2020-05-23 20:23:52