简体   繁体   English

如何将行中的结构字段转换为 Spark Java 中的 avro 记录

[英]How to convert a struct field in a Row to an avro record in Spark Java

I have a use case where I want to convert a struct field to an Avro record.我有一个用例,我想将结构字段转换为 Avro 记录。 The struct field originally maps to an Avro type. struct 字段最初映射到 Avro 类型。 The input data is avro files and the struct field corresponds to a field in the input avro records.输入数据是avro文件,struct字段对应输入avro记录中的一个字段。

Below is what I want to achieve in pseudocode.以下是我想用伪代码实现的目标。

DataSet<Row> data = loadInput(); // data is of form (foo, bar, myStruct) from avro data. 

// do some joins to add more data
data = doJoins(data); // now data is of form (a, b, myStruct)

// transform DataSet<Row> to DataSet<MyType> 
DataSet<MyType> myData = data.map(row -> myUDF(row), encoderOfMyType);

// method `myUDF` definition
MyType myUDF(Row row) {
  String a = row.getAs("a");
  String b = row.getAs("b");

  // MyStruct is the generated avro class that corresponds to field myStruct 
  MyStruct myStruct = convertToAvro(row.getAs("myStruct"));

  return generateMyType(a, b, myStruct);
}

My question is: how can I implement the convertToAvro method in above pseudocode?我的问题是:如何在上面的伪代码中实现convertToAvro方法?

From the documentation :文档

The Avro package provides function to_avro to encode a column as binary in Avro format, and from_avro() to decode Avro binary data into a column. Avro 包提供函数 to_avro 将列编码为 Avro 格式的二进制,以及 from_avro() 将 Avro 二进制数据解码为列。 Both functions transform one column to another column, and the input/output SQL data type can be a complex type or a primitive type.这两个函数都将一列转换为另一列,输入/输出 SQL 数据类型可以是复杂类型或原始类型。

The function to_avro acts as replacement for the convertToAvro method:函数to_avro替代了convertToAvro方法:

import static org.apache.spark.sql.avro.functions.*;

//put the avro schema of the struct column into a string
//in my example I assume that the struct consists of a two fields:
//a long field (s1) and a string field (s2)
String schema = "{\"type\":\"record\",\"name\":\"mystruct\"," +
        "\"namespace\":\"topLevelRecord\",\"fields\":[{\"name\":\"s1\"," +
        "\"type\":[\"long\",\"null\"]},{\"name\":\"s2\",\"type\":" +
        "[\"string\",\"null\"]}]},\"null\"]}";

data = ...

//add an additional column containing the struct as binary column
Dataset<Row> data2 = df.withColumn("to_avro", to_avro(data.col("myStruct"), schema));
df2.printSchema();
df2.show(false);

prints印刷

root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)
 |-- mystruct: struct (nullable = true)
 |    |-- s1: long (nullable = true)
 |    |-- s2: string (nullable = true)
 |-- to_avro: binary (nullable = true)

+----+----+----------+----------------------------+
|a   |b   |mystruct  |to_avro                     |
+----+----+----------+----------------------------+
|foo1|bar1|[1, one]  |[00 02 00 06 6F 6E 65]      |
|foo2|bar2|[3, three]|[00 06 00 0A 74 68 72 65 65]|
+----+----+----------+----------------------------+

To convert the avro column back, the function from_avro can be used:要将 avro 列转换回来,可以使用函数from_avro

Dataset<Row> data3 = data2.withColumn("from_avro", from_avro(data2.col("to_avro"), schema));
df3.printSchema();
df3.show();

Output:输出:

root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)
 |-- mystruct: struct (nullable = true)
 |    |-- s1: long (nullable = true)
 |    |-- s2: string (nullable = true)
 |-- to_avro: binary (nullable = true)
 |-- from_avro: struct (nullable = true)
 |    |-- s1: long (nullable = true)
 |    |-- s2: string (nullable = true)

+----+----+----------+--------------------+----------+
|   a|   b|  mystruct|             to_avro| from_avro|
+----+----+----------+--------------------+----------+
|foo1|bar1|  [1, one]|[00 02 00 06 6F 6...|  [1, one]|
|foo2|bar2|[3, three]|[00 06 00 0A 74 6...|[3, three]|
+----+----+----------+--------------------+----------+

A word about the udf: in the question you performed the transformation to the avro format within the udf.关于 udf 的一句话:在问题中,您在 udf 中执行了到 avro 格式的转换。 I would prefer to include only the actual business logic in the udf and keep the format transformation outside.我更愿意在 udf 中只包含实际的业务逻辑,并将格式转换保留在外面。 This separates the logic and the format transformation.这将逻辑和格式转换分开。 If necessary, you can drop the original column mystruct after creating the avro column.如有必要,您可以在创建 avro 列后删除原始列mystruct

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM