Spark Structured Streaming - 在有状态 stream 处理中使用 Window 操作进行事件处理

Question

我是 Spark Structured Streaming 处理的新手，目前正在研究一个用例，其中结构化流应用程序将从 Azure IoT Hub-Event hub 获取事件（比如每 20 秒后）。

任务是使用这些事件并实时处理它。 为此，我在下面用 Spark-Java 编写了 Spark 结构化流程序。

以下是重点

目前我已经应用了 window 操作，间隔 10 分钟，滑动 window 5 分钟。
水印以 10 分钟的间隔应用于 eventDate 参数。
目前我没有执行任何其他操作，只是以 Parquet 格式将其存储在指定位置。
该程序将一个事件存储在一个文件中。

问题：

是否可以根据 window 时间将多个事件以镶木地板格式存储在一个文件中？
在这种情况下，window 操作如何工作？
此外，我想检查事件 state 和先前的事件，并根据一些计算（比如 5 分钟前未收到事件）我想更新 state。

...

public class EventSubscriber {

   public static void main(String args[]) throws InterruptedException, StreamingQueryException {

    String eventHubCompatibleEndpoint = "<My-EVENT HUB END POINT CONNECTION STRING>";

    String connString = new ConnectionStringBuilder(eventHubCompatibleEndpoint).build();

    EventHubsConf eventHubsConf = new EventHubsConf(connString).setConsumerGroup("$Default")
            .setStartingPosition(EventPosition.fromEndOfStream()).setMaxRatePerPartition(100)
            .setReceiverTimeout(java.time.Duration.ofMinutes(10));

    SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("IoT Spark Streaming");

    SparkSession spSession = SparkSession.builder()
            .appName("IoT Spark Streaming")
            .config(sparkConf).config("spark.sql.streaming.checkpointLocation", "<MY-CHECKPOINT-LOCATION>")
            .getOrCreate();

    Dataset<Row> inputStreamDF = spSession.readStream()
            .format("eventhubs")
            .options(eventHubsConf.toMap())
            .load();

    Dataset<Row> bodyRow = inputStreamDF.withColumn("body", new Column("body").cast(DataTypes.StringType)).select("body");

    StructType jsonStruct = new StructType()
            .add("eventType", DataTypes.StringType)
            .add("payload", DataTypes.StringType);

    Dataset<Row> messageRow = bodyRow.map((MapFunction<Row, Row>) value -> {
        String valStr = value.getString(0).toString();

        String payload = valStr;

        Gson gson = new GsonBuilder().serializeNulls().setPrettyPrinting().create();

        JsonObject jsonObj = gson.fromJson(valStr, JsonObject.class);

        JsonElement methodName = jsonObj.get("method");

        String eventType = null;
        if(methodName != null) {
            eventType = "OTHER_EVENT";
        } else {
            eventType = "DEVICE_EVENT";
        }

        Row jsonRow = RowFactory.create(eventType, payload);
        return jsonRow;

    }, RowEncoder.apply(jsonStruct));

    messageRow.printSchema();

    Dataset<Row> deviceEventRowDS = messageRow.filter("eventType = 'DEVICE_EVENT'");

    deviceEventRowDS.printSchema();

    Dataset<DeviceEvent> deviceEventDS = deviceEventRowDS.map((MapFunction<Row, DeviceEvent>) value -> {

        String jsonString = value.getString(1).toString();

        Gson gson = new GsonBuilder().serializeNulls().setPrettyPrinting().create();

        DeviceMessage deviceMessage = gson.fromJson(jsonString, DeviceMessage.class);
        DeviceEvent deviceEvent = deviceMessage.getDeviceEvent();
        return deviceEvent;

    }, Encoders.bean(DeviceEvent.class));

    deviceEventDS.printSchema();

    Dataset<Row> messageDataset = deviceEventDS.select(
            functions.col("eventType"), 
            functions.col("deviceID"),
            functions.col("description"),
            functions.to_timestamp(functions.col("eventDate"), "yyyy-MM-dd hh:mm:ss").as("eventDate"),
            functions.col("deviceModel"),
            functions.col("pingRate"))
            .select("eventType", "deviceID", "description", "eventDate", "deviceModel", "pingRate");

    messageDataset.printSchema();

    Dataset<Row> devWindowDataset = messageDataset.withWatermark("eventDate", "10 minutes")
            .groupBy(functions.col("deviceID"),
                    functions.window(
                            functions.col("eventDate"), "10 minutes", "5 minutes"))
            .count();

    devWindowDataset.printSchema();

    StreamingQuery query = devWindowDataset.writeStream().outputMode("append")
            .format("parquet")
            .option("truncate", "false")
            .option("path", "<MY-PARQUET-FILE-OUTPUT-LOCATION>")
            .start();

    query.awaitTermination();
}}

...

任何与此相关的帮助或指导都会很有用。

谢谢并恭祝安康，

阿维纳什·德什穆赫

Answer 1

是否可以根据 window 时间将多个事件以镶木地板格式存储在一个文件中？

是的。

在这种情况下，window 操作如何工作？

以下代码是 Spark Structured Streaming 应用程序的主要部分：

Dataset<Row> devWindowDataset = messageDataset
  .withWatermark("eventDate", "10 minutes")
  .groupBy(
    functions.col("deviceID"),
    functions.window(functions.col("eventDate"), "10 minutes", "5 minutes"))
  .count();

也就是说，底层 state 存储应将每个deviceID和eventDate的 state 保留 10 分钟，并为延迟事件保留额外的10 minutes （每个withWatermark ）。 换句话说，一旦事件在流式查询开始后 20 分钟具有eventDate ，您应该会看到结果。

withWatermark用于延迟事件，因此即使groupBy会产生结果，只有在超过水印阈值时才会发出结果。

每 10 分钟（+ 10 分钟的水印）使用 5 分钟的 window 幻灯片应用相同的程序。

将带有window运算符的groupBy视为多列聚合。

此外，我想检查事件 state 和先前的事件，并根据一些计算（比如 5 分钟前未收到事件）我想更新 state。

这听起来像是KeyValueGroupedDataset.flatMapGroupsWithState运算符（又名Arbitrary Stateful Streaming Aggregation ）的用例。 请参阅任意有状态操作。

您也可能只需要许多聚合标准函数之一或用户定义的聚合 function (UDAF) 。

Spark Structured Streaming - 在有状态 stream 处理中使用 Window 操作进行事件处理

问题描述

1 个解决方案

解决方案1
0 2019-10-21 18:22:31

Spark Structured Streaming - 在有状态 stream 处理中使用 Window 操作进行事件处理

问题描述

1 个解决方案

解决方案1 0 2019-10-21 18:22:31

解决方案1
0 2019-10-21 18:22:31