jdbc source and spark structured streaming

Question

I've been using spark structured streaming, and quite happy with it. I'm currently performing an ETL type activity. I have a table based in PostgreSQL which contains metadata type information, that I'm looking to merge with the streaming data frame.

metadataDf = spark \
    .read \
    .jdbc(url=jdbcUrl, \
        table = query,
        properties = connectionProperties) 

streamDF = spark \
    .readStream \
    .option("maxFilesPerTrigger",10) \
    .option("latestFirst",True) \
    .schema(sensorSchema) \
    .json(sensorPath)

joined_metadata = streamDF \
    .join(metadataDf,["uid"],"left")

write_query = joined_metadata \
    .writeStream \
    .trigger(processingTime=arbitarytime) \
    .format("json") \
    .option("checkpointLocation",chkploc) \
    .option("path",write_path) \
    .start()

The metadata table on postgresql can get updated once every couple of days. I was wondering, do I need to accommodate the refresh of the table on spark with some kind of while loop. Or does spark's lazy eval takes care of that particular scenario.

Thanks

Answer 1

Spark will take care of it as long as the program is running. If you don't specify a trigger interval Spark will process this stream continuously (each batch starts once the last has finished)

To specify a trigger interval see df.trigger() here and in the docs

:)

jdbc source and spark structured streaming

Question

1 answers

solution1
0 ACCPTED 2018-03-21 19:57:18

jdbc source and spark structured streaming

Question

1 answers

solution1 0 ACCPTED 2018-03-21 19:57:18

solution1
0 ACCPTED 2018-03-21 19:57:18