简体   繁体   中英

pyspark: how to convert dataframes with a time column to a spark streaming object?

Let's say you have a Spark dataframe df with a column timestamp representing time, let's say in unix-time format (seconds since 1970). How do I make Spark.Streaming treat this as an input so that I can do sliding window on the data? Thanks!

You cannot, or at least not in a meaningful way. While it is possible to use queueStream to create stream from RDD like this:

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 10)
df = sc.parallelize([(i, ) for i in range(10000)]).toDF(["ts"])
stream = ssc.queueStream([df.rdd])
stream.count().pprint()

ssc.start()
ssc.awaitTermination()

where the correspondence between batch and object in queue is 1:1. Unfortunately queueStream is, unlike its Scala counterpart, a static stream. It is not possible to enqueue new data after it has been created. It means you have split DataFrame manually into multiple RDD.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM