简体   繁体   English

Apache Spark SQL StructType 和 UDF

[英]Apache Spark SQL StructType together with UDF

Spark 1.6 / Java-7

Initial dataframe with a new column初始 dataframe 带新列

# adding new column for the UDF computation:
df = df.withColumn("TEMP_COLUMN", lit(null));

What is the correct format for the UDF function to create new StructType and put it into the cell? UDF function 创建新的 StructType 并将其放入单元格的正确格式是什么?

public static DataFrame compute(SQLContext sqlContext, DataFrame df) {
    sqlContext.udf().register("compute", new MyUdf(), new ArrayType(new StructType(), true));
    return df.withColumn("TEMP_COLUMN", functions.callUDF("compute"));
}

class MyUdf implements UDF0<List<StructType>> {
@Override
public  List<StructType> call() {
    ...
    return ? // what must be returned here? List<StructType> or List<String> or anything else?
}


+-------------------------+
|TEMP_COLUMN              |
+-------------------------+
|[A[1, 2, 3], B[4, 5, 6]] |
+-------------------------+

I want to have a structure with the array of elements with several fields for each element.我想要一个包含元素数组的结构,每个元素都有几个字段。
I don't understand, is registration with the type new ArrayType(new StructType(), true) correct and the same for the return type of the UDF function List<StructType> .我不明白,注册类型new ArrayType(new StructType(), true)是否正确,UDF function List<StructType>的返回类型是否相同。
How is the data should be returned?数据应该如何返回? Is it like new StructType(new StructField[]{new StructField(...)) ?是不是像new StructType(new StructField[]{new StructField(...))

Answering my own question since we were lucky to find out how to do it:回答我自己的问题,因为我们很幸运地找到了如何做到这一点:

Let's say that we have a 'complex' structure for our needs:假设我们有一个“复杂”的结构来满足我们的需求:

MapType CLIENTS_INFO_DATA_TYPE = DataTypes.createMapType(
  DataTypes.StringType,
  DataTypes.createStructType(
    new StructField[] {
        DataTypes.createStructField("NAME_1", DataTypes.DoubleType, false),
        DataTypes.createStructField("NAME_2", DataTypes.DoubleType, false),
        DataTypes.createStructField("NAME_3", DataTypes.DoubleType, false)
  ),
  true
);


StructType COMPLEX_DATA_TYPE = DataTypes.createStructType(new StructField[] {
  DataTypes.createStructField("CLIENTS_INFO", CLIENTS_INFO_DATA_TYPE, true),
  DataTypes.createStructField("COMMENT", DataTypes.StringType, true)
}

And it's schema:它的架构:

dataFrame.printSchema()

|-- COMPLEX_DATA_TYPE: struct (nullable = true)
|    |-- CLIENTS_INFO: map (nullable = true)
|    |    |-- key: string
|    |    |-- value: struct (valueContainsNull = true)
|    |    |    |-- NAME_1: double (nullable = false)
|    |    |    |-- NAME_2: double (nullable = false)
|    |    |    |-- NAME_3: double (nullable = false)
|    |-- COMMENT: string (nullable = true)

Next we have to register the UDF function that operates with our structure:接下来,我们必须注册使用我们的结构运行的 UDF function:

DataFrame compute(SQLContext sqlContext, DataFrame df) {
sqlContext.udf().register(
        "computeUDF",
        new MyUDF(),
        COMPLEX_DATA_TYPE);

  return df.withColumn("TEMP_FIELD_NAME", functions.callUDF("computeUDF", field_1.getColumn(), field_2.getColumn()));
}

And the final step is the UDF function by itself that returns a Row object (that will be converted into our structure):最后一步是 UDF function 本身返回一个Row object (将转换为我们的结构):

public final class MyUDF implements UDF2<Double, Double, Row> {
@Override
public Row call(Double value1, Double value2) {
    Map<String, Row> clientsInfoMap = new HashMap<>();
    ...
    for (Map.Entry<String, ClientInfo> clientInfoEntry : clientsInfo.entrySet()) {
        final String client = clientInfoEntry.getKey();
        final ClientInfo clientInfo = clientInfoEntry.getValue();

        final Double[] clientInfoValues = {10.0, 20.0, 30.0};
        
        Row clientInfoRow = new GenericRow(clientInfoValues);
        clientsInfoMap.put(client, clientInfoRow);
    }

    Object[] fullClientsInfo = new Object[] {clientsInfoMap, "string-as-a-comment"};
    return new GenericRow(fullClientsInfo);
  }
}

And now, since it is a structure, we can select by using the TEMP_FIELD_NAME.CLIENTS_INFO and anything else by the namings.现在,由于它是一个结构,我们可以使用TEMP_FIELD_NAME.CLIENTS_INFO和其他任何名称来 select。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM