[英]Insert missing elements in list as Rows per Time-Window group to DataFrame
試圖從語法上解決這個問題……似乎是一個困難的問題……基本上,如果未在時間序列時間戳記間隔源數據中捕獲傳感器項,則希望為每個缺少的傳感器項附加一行,每行的值為NULL時間戳窗口
# list of sensor items [have 300 plus; only showing 4 as example]
list = ["temp", "pressure", "vacuum", "burner"]
# sample data
df = spark.createDataFrame([('2019-05-10 7:30:05', 'temp', '99'),\
('2019-05-10 7:30:05', 'burner', 'TRUE'),\
('2019-05-10 7:30:10', 'vacuum', '.15'),\
('2019-05-10 7:30:10', 'burner', 'FALSE'),\
('2019-05-10 7:30:10', 'temp', '75'),\
('2019-05-10 7:30:15', 'temp', '77'),\
('2019-05-10 7:30:20', 'pressure', '.22'),\
('2019-05-10 7:30:20', 'temp', '101'),], ["date", "item", "value"])
# current dilemma => all sensor items are not being captured / only updates to sensors are being captured in current back-end design streaming devices
+------------------+--------+-----+
| date| item|value|
+------------------+--------+-----+
|2019-05-10 7:30:05| temp| 99|
|2019-05-10 7:30:05| burner| TRUE|
|2019-05-10 7:30:10| vacuum| .15|
|2019-05-10 7:30:10| burner|FALSE|
|2019-05-10 7:30:10| temp| 75|
|2019-05-10 7:30:15| temp| 77|
|2019-05-10 7:30:20|pressure| .22|
|2019-05-10 7:30:20| temp| 101|
+------------------+--------+-----+
想要捕獲每個時間戳的每個傳感器項,因此可以在旋轉數據幀之前執行正向填充估算[正向填充300 plus cols會導致scala錯誤=>
引發火花:java.lang.StackOverflowError窗口函數?
# desired output
+------------------+--------+-----+
| date| item|value|
+------------------+--------+-----+
|2019-05-10 7:30:05| temp| 99|
|2019-05-10 7:30:05| burner| TRUE|
|2019-05-10 7:30:05| vacuum| NULL|
|2019-05-10 7:30:05|pressure| NULL|
|2019-05-10 7:30:10| vacuum| .15|
|2019-05-10 7:30:10| burner|FALSE|
|2019-05-10 7:30:10| temp| 75|
|2019-05-10 7:30:10|pressure| NULL|
|2019-05-10 7:30:15| temp| 77|
|2019-05-10 7:30:15|pressure| NULL|
|2019-05-10 7:30:15| burner| NULL|
|2019-05-10 7:30:15| vacuum| NULL|
|2019-05-10 7:30:20|pressure| .22|
|2019-05-10 7:30:20| temp| 101|
|2019-05-10 7:30:20| vacuum| NULL|
|2019-05-10 7:30:20| burner| NULL|
+------------------+--------+-----+
擴展我的評論 :
您可以直接與不同的日期和的笛卡爾乘積加入您的數據幀sensor_list
。 由於sensor_list
很小,因此可以broadcast
它。
from pyspark.sql.functions import broadcast
sensor_list = ["temp", "pressure", "vacuum", "burner"]
df.join(
df.select('date')\
.distinct()\
.crossJoin(broadcast(spark.createDataFrame([(x,) for x in sensor_list], ["item"]))),
on=["date", "item"],
how="right"
).sort("date", "item").show()
#+------------------+--------+-----+
#| date| item|value|
#+------------------+--------+-----+
#|2019-05-10 7:30:05| burner| TRUE|
#|2019-05-10 7:30:05|pressure| null|
#|2019-05-10 7:30:05| temp| 99|
#|2019-05-10 7:30:05| vacuum| null|
#|2019-05-10 7:30:10| burner|FALSE|
#|2019-05-10 7:30:10|pressure| null|
#|2019-05-10 7:30:10| temp| 75|
#|2019-05-10 7:30:10| vacuum| .15|
#|2019-05-10 7:30:15| burner| null|
#|2019-05-10 7:30:15|pressure| null|
#|2019-05-10 7:30:15| temp| 77|
#|2019-05-10 7:30:15| vacuum| null|
#|2019-05-10 7:30:20| burner| null|
#|2019-05-10 7:30:20|pressure| .22|
#|2019-05-10 7:30:20| temp| 101|
#|2019-05-10 7:30:20| vacuum| null|
#+------------------+--------+-----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.