简体繁体 English

如何使用 PySpark 结构化流计算时间戳之间的差异

[英]How to compute difference between timestamps with PySpark Structured Streaming

原文 2019-11-14 13:46:28 8 1 apache-spark/ pyspark/ spark-structured-streaming

I have the following problem with PySpark Structured Streaming. PySpark 结构化流媒体存在以下问题。

Every line in my stream data has a user ID and a timestamp.我的 stream 数据中的每一行都有一个用户 ID 和一个时间戳。 Now, for every line and for every user, I want to add a column with the difference of the timestamps.现在，对于每一行和每个用户，我想添加一个带有时间戳差异的列。

For example, suppose the first line that I receive says: "User A, 08:00:00".例如，假设我收到的第一行是：“用户 A，08:00:00”。 If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".如果第二行显示“用户 A，08:00:10”，那么我想在第二行添加一个名为“间隔”的列，表示“10 秒”。

Is there anyone who knows how to achieve this?有谁知道如何实现这一目标？ I tried to use the window functions examples of the Structured Streaming documentation but it was useless.我尝试使用结构化流文档的 window 函数示例，但它没有用。

Thank you very much非常感谢

1 个解决方案

Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation ( groupBy and groupByKey ).由于我们正在谈论结构化流和“每一行和每个用户” ，这告诉我您应该使用带有某种流聚合（ groupBy和groupByKey ）的流查询。

For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming.对于流式聚合，您只能依靠结构化流中的微批处理 stream 执行。 That gives that records for a single user could be part of two different micro-batches.这使得单个用户的记录可能是两个不同微批次的一部分。 That gives that you need a state.这表明您需要一个 state。

That all together gives that you need a stateful streaming aggregation.综上所述，您需要有状态的流式聚合。

With that, I think you want one of the Arbitrary Stateful Operations , ie KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset ):有了这个，我想你想要一个Arbitrary Stateful Operations ，即KeyValueGroupedDataset.mapGroupsWithState或KeyValueGroupedDataset.flatMapGroupsWithState （见KeyValueGroupedDataset ）：

Many usecases require more advanced stateful operations than aggregations.许多用例需要比聚合更高级的有状态操作。 For example, in many usecases, you have to track sessions from data streams of events.例如，在许多用例中，您必须从事件的数据流中跟踪会话。 For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.为了进行这种会话化，您必须将任意类型的数据保存为 state，并在每个触发器中使用数据 stream 事件对 state 执行任意操作。

Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState .从 Spark 2.2 开始，这可以使用操作mapGroupsWithState和更强大的操作flatMapGroupsWithState来完成。 Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.这两个操作都允许您在分组数据集上应用用户定义的代码来更新用户定义的 state。

A state would be per user with the last record found. state 将是每个用户找到的最后一条记录。 That looks doable.这看起来可行。

My concerns would be:我的担忧是：

How many users is this streaming query going to deal with?这个流式查询要处理多少用户？ (the more the bigger the state) （越多状态越大）
When to clean up the state (of users that are no longer expected in a stream)?何时清理 state（不再期望在流中的用户）？ (which would keep the state of a reasonable size) （这将使 state 保持合理的大小）