pyspark: how to convert dataframes with a time column to a spark streaming object?

Question

Let's say you have a Spark dataframe df with a column timestamp representing time, let's say in unix-time format (seconds since 1970). How do I make Spark.Streaming treat this as an input so that I can do sliding window on the data? Thanks!

Answer 1

You cannot, or at least not in a meaningful way. While it is possible to use queueStream to create stream from RDD like this:

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 10)
df = sc.parallelize([(i, ) for i in range(10000)]).toDF(["ts"])
stream = ssc.queueStream([df.rdd])
stream.count().pprint()

ssc.start()
ssc.awaitTermination()

where the correspondence between batch and object in queue is 1:1. Unfortunately queueStream is, unlike its Scala counterpart, a static stream. It is not possible to enqueue new data after it has been created. It means you have split DataFrame manually into multiple RDD.

pyspark: how to convert dataframes with a time column to a spark streaming object?

Question

1 answers

solution1
0 2016-05-26 00:58:10

pyspark: how to convert dataframes with a time column to a spark streaming object?

Question

1 answers

solution1 0 2016-05-26 00:58:10

solution1
0 2016-05-26 00:58:10