Let's say you have a Spark dataframe df
with a column timestamp
representing time, let's say in unix-time format (seconds since 1970). How do I make Spark.Streaming treat this as an input so that I can do sliding window on the data? Thanks!
You cannot, or at least not in a meaningful way. While it is possible to use queueStream
to create stream from RDD like this:
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 10)
df = sc.parallelize([(i, ) for i in range(10000)]).toDF(["ts"])
stream = ssc.queueStream([df.rdd])
stream.count().pprint()
ssc.start()
ssc.awaitTermination()
where the correspondence between batch and object in queue is 1:1. Unfortunately queueStream
is, unlike its Scala counterpart, a static stream. It is not possible to enqueue new data after it has been created. It means you have split DataFrame
manually into multiple RDD.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.