將數據流式傳輸到增量表並保存最新值

Question

我正在將一些溫度數據從 Azure 事件中心流式傳輸到 Databricks，並希望將最新值存儲在增量表中。 對於每個傳感器的溫度值，我取最后五分鍾的最大值。 我似乎正在使用增量表的“更新插入”來阻止。 每個設備的數據每 10-15 秒就會出現一次。 我不確定我是否正確使用了 writeStream 或者可能必須在數據框上使用窗口函數來插入最新的 aggerated 值。

到目前為止，我已經在pysprak中創建了一個基本示例，看看是否可以完成

#This sets up the data frame    
df = spark.readStream.format("eventhubs").options(**ehConf).load().selectExpr("cast (body as string) as body")

# rounds up the time into 5 minutes
df = df.select(
  get_json_object(df.body,'$.sensorId').alias('sensorId'), 
  get_json_object(df.body,'$.count').alias('temp'), 
  to_timestamp(from_unixtime(round(((get_json_object(df.body,'$.timestamp')/1000)/300))*300.0 ,"yyyy-MM-dd HH:mm:ss")).alias("roundedDatetime")
)

# Groups by the sensor id and round date
df = df.groupBy("sensorId", "roundedDatetime").agg(max("temp").cast("int").alias("temp"))

一切正常，我可以看到 5 分鍾聚合級別的數據

# Should insert trigger the batch every five minutes
query = (df.writeStream.format("delta").trigger(processingTime="5 minutes").foreachBatch(upsertToDelta).outputMode("update").start())


# this is my basic batch function, taken from the example docs on streaming

def upsertToDelta(microbatchdf, batchId):
  microbatchdf.createOrReplaceTempView("updates")
  
  microbatchdf._jdf.sparkSession().sql("""
    MERGE INTO latestSensorReadings t
    USING updates s
    ON s.sensorId = t.sensorId
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
  """)

所以當第一次遇到一個空的增量表時，它很好，五分鍾后會出現合並沖突，因為它試圖插入相同的值。 它是否試圖更新整個數據框而不是最新項目？

我已經看過滑動窗口在事件時間上進行分組，但這似乎不起作用。 我正在考慮在微批處理函數中添加一個窗口函數，這樣它只會在有多個項目時插入最新的值，例如，一個在 10:00am 和 10:05am 的舍入值，它將需要10:05 一。 推薦？ 我想我可能沒有完全正確觸發？ 我試過一分鍾上下減少它，但沒有快樂。

Answer 1

我想你忘了給t作為你的 latestSensorReadings 表的別名。 你能不能試試：

    MERGE INTO latestSensorReadings t
    USING updates s
    ON s.sensorId = t.sensorId
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *

Answer 2

因此，在環顧四周並更深入地了解流媒體之后，看起來我做錯了。 為了讓它做我想做的事，我需要在處理批處理之前刪除分組。

#This sets up the data frame    
df = spark.readStream.format("eventhubs").options(**ehConf).load().selectExpr("cast (body as string) as body")

# get the details from the event hub binary
df = df.select(
  get_json_object(df.body,'$.sensorId').alias('sensorId'), 
  get_json_object(df.body,'$.count').alias('temp'))

所以我只得到詳細的細節，然后每 5 分鍾處理一次該批次。 所以我的批處理函數看起來像：

# Should insert trigger the batch every five minutes
query = (df.writeStream.format("delta").trigger(processingTime="1 minutes").foreachBatch(upsertToDelta).outputMode("update").start())


# this is my updated batch function, now doing the grouping

def upsertToDelta(microbatchdf, batchId):

  microbatchdf = microbatchdf.groupBy("sensorId").agg(max("temp").cast("int").alias("temp"))
  microbatchdf = microbatchdf.withColumn("latestDatetime", current_timestamp())

  microbatchdf.createOrReplaceTempView("updates")
  
  microbatchdf._jdf.sparkSession().sql("""
    MERGE INTO latestSensorReadings t
    USING updates s
    ON s.sensorId = t.sensorId
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
  """)

將數據流式傳輸到增量表並保存最新值

問題描述

2 個解決方案

解決方案1
0 2020-11-10 22:02:47

解決方案2
0 已采納 2020-11-11 09:49:30

將數據流式傳輸到增量表並保存最新值

問題描述

2 個解決方案

解決方案1 0 2020-11-10 22:02:47

解決方案2 0 已采納 2020-11-11 09:49:30

解決方案1
0 2020-11-10 22:02:47

解決方案2
0 已采納 2020-11-11 09:49:30