简体   繁体   English

过去 N 个数据点上的 Pyspark 结构化流窗口(移动平均)

[英]Pyspark structured streaming window (moving average) over last N data points

I read several data frames from kafka topics using Pyspark Structured Streaming 2.4.4.我使用 Pyspark Structured Streaming 2.4.4 从 kafka 主题中读取了几个数据帧。 I would like to add some new columns to that data frames that mainly are based on window calculations over past N data points (for instance: Moving average over last 20 data points), and as a new data point is delivered, the corresponding value of MA_20 should be instantly calculated.我想向该数据框中添加一些新列,这些列主要基于过去 N 个数据点的窗口计算(例如:最近 20 个数据点的移动平均值),并且随着新数据点的传递,相应的值MA_20 应立即计算。

Data may look like this: Timestamp |数据可能如下所示: Timestamp | VIX波动率指数

2020-01-22 10:20:32 | 13.05
2020-01-22 10:25:31 | 14.35
2020-01-23 09:00:20 | 14.12

It is worth to mention that data will be received from Monday to Friday over 8 hour period a day.值得一提的是,数据将在周一至周五每天 8 小时内接收。 Thus Moving average calculated on Monday morning should include data from Friday!因此,周一早上计算的移动平均线应该包括周五的数据!

I tried different approaches but, still I am not able to achieve what I want.我尝试了不同的方法,但仍然无法实现我想要的。

windows = df_vix \
    .withWatermark("Timestamp", "100 minutes") \
    .groupBy(F.window("Timestamp", "100 minute", "5 minute")) \

aggregatedDF = windows.agg(F.avg("VIX"))

Preceding code calculated MA but it will consider data from Friday as late, so they will be excluded.前面的代码计算 MA 但它会将周五的数据视为晚,因此将它们排除在外。 better than last 100 minutes should be last 20 points (with 5 minute intervals).比前 100 分钟好应该是最后 20 分(间隔 5 分钟)。

I thought that I can use rowsBetween or rangeBetween, but in streaming data frames window cannot be applie over non-timestamp columns (F.col('Timestamp').cast('long'))我以为我可以使用rowsBetween 或rangeBetween,但在流数据帧窗口中不能应用于非时间戳列(F.col('Timestamp').cast('long'))

    w = Window.orderBy(F.col('Timestamp').cast('long')).rowsBetween(-600, 0)

    df = df_vix.withColumn('MA_20', F.avg('VIX').over(w)

)

But on the other hand there is no possibility to specify interval within rowsBetween(), using rowsBetween(- minutes(20), 0) throws: minutes are not defined (there is no such a function in sql.functions)但另一方面,不可能在rowsBetween() 中指定间隔,使用rowsBetween(- minutes(20), 0) throws: minutes is not defined(sql.functions 中没有这样的函数)

I found the other way, but it doesn't work for streaming data frames either.我找到了另一种方式,但它也不适用于流式数据帧。 Don't know why 'Non-time-based windows are not supported on streaming DataFrames' error is raised (df_vix.Timestamp is of timestamp type)不知道为什么会引发“流式数据帧不支持基于时间的窗口”错误(df_vix.Timestamp 是时间戳类型)

df.createOrReplaceTempView("df_vix")

df_vix.createOrReplaceTempView("df_vix")
aggregatedDF = spark.sql(
    """SELECT *, mean(VIX) OVER (
        ORDER BY CAST(df_vix.Timestamp AS timestamp)
        RANGE BETWEEN INTERVAL 100 MINUTES PRECEDING AND CURRENT ROW
     ) AS mean FROM df_vix""")

I have no idea what else could I use to calculate simple Moving Average.我不知道我还能用什么来计算简单的移动平均线。 It looks like it is impossible to achive that in Pyspark... maybe better solution will be to transform each time new data is comming entire Spark data frame to Pandas and calculate everything in Pandas (or append new rows to pandas and calculate MA) ???看起来在 Pyspark 中不可能实现……也许更好的解决方案是在每次新数据将整个 Spark 数据帧传送到 Pandas 时进行转换并计算 Pandas 中的所有内容(或将新行附加到 Pandas 并计算 MA)? ??

I thought that creating new features as new data is comming is the main purpose of Structured Streaming, but as it turned out Pyspark is not suited to this, I am considering giving up Pyspark an move to Pandas ...我认为在新数据到来时创建新功能是 Structured Streaming 的主要目的,但事实证明 Pyspark 不适合这一点,我正在考虑放弃 Pyspark 转向 Pandas ...

EDIT编辑

The following doesn't work as well, altough df_vix.Timestamp of type: 'timestamp', but it throws 'Non-time-based windows are not supported on streaming DataFrames' error anyway.以下内容不起作用,尽管 df_vix.Timestamp 类型为:'timestamp',但无论如何它都会抛出“流式数据帧不支持非基于时间的窗口”错误。

w = Window.orderBy(df_vix.Timestamp).rowsBetween(-20, -1)
aggregatedDF = df_vix.withColumn("MA", F.avg("VIX").over(w))

Have you looked at window operation in event times?你看过事件时间的窗口操作吗? window(timestamp, "10 minutes", "5 minutes") Will give you a dataframe of 10 minutes every 5 minutes that you can then do aggregations on, including moving averages. window(timestamp, "10 minutes", "5 minutes")将为您提供每 5 分钟 10 分钟的数据框,然后您可以对其进行聚合,包括移动平均线。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM