简体繁体中英

How to compute difference between timestamps with PySpark Structured Streaming

原文 2019-11-14 13:46:28 6 1 apache-spark/ pyspark/ spark-structured-streaming

I have the following problem with PySpark Structured Streaming.

Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.

For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".

Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.

Thank you very much

1 answers

Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation ( groupBy and groupByKey ).

For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.

That all together gives that you need a stateful streaming aggregation.

With that, I think you want one of the Arbitrary Stateful Operations , ie KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset ):

Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.

Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState . Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.

A state would be per user with the last record found. That looks doable.

My concerns would be:

How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)

What is the difference between Spark Structured Streaming and DStreams?

Pyspark join with functions and difference between timestamps

How to Write Structured Streaming Data into Cassandra with PySpark?

How to handle timestamp in Pyspark Structured Streaming

Pyspark Structured streaming processing

Structured Streaming in pyspark

How to calculate lag difference in Spark Structured Streaming?

pyspark - structured streaming into elastic search

Spark Structured Streaming with State (Pyspark)

How to deal with pySpark structured streaming coming from Kafka to Cassandra

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question What is the difference between Spark Structured Streaming and DStreams? Pyspark join with functions and difference between timestamps How to Write Structured Streaming Data into Cassandra with PySpark? How to handle timestamp in Pyspark Structured Streaming Pyspark Structured streaming processing Structured Streaming in pyspark How to calculate lag difference in Spark Structured Streaming? pyspark - structured streaming into elastic search Spark Structured Streaming with State (Pyspark) How to deal with pySpark structured streaming coming from Kafka to Cassandra

Related Tags

How to compute difference between timestamps with PySpark Structured Streaming

Question

1 answers

solution1 1 2019-11-15 10:37:42

solution1
1 2019-11-15 10:37:42