简体   繁体   中英

jdbc source and spark structured streaming

I've been using spark structured streaming, and quite happy with it. I'm currently performing an ETL type activity. I have a table based in PostgreSQL which contains metadata type information, that I'm looking to merge with the streaming data frame.

metadataDf = spark \
    .read \
    .jdbc(url=jdbcUrl, \
        table = query,
        properties = connectionProperties) 

streamDF = spark \
    .readStream \
    .option("maxFilesPerTrigger",10) \
    .option("latestFirst",True) \
    .schema(sensorSchema) \
    .json(sensorPath)

joined_metadata = streamDF \
    .join(metadataDf,["uid"],"left")

write_query = joined_metadata \
    .writeStream \
    .trigger(processingTime=arbitarytime) \
    .format("json") \
    .option("checkpointLocation",chkploc) \
    .option("path",write_path) \
    .start()

The metadata table on postgresql can get updated once every couple of days. I was wondering, do I need to accommodate the refresh of the table on spark with some kind of while loop. Or does spark's lazy eval takes care of that particular scenario.

Thanks

Spark will take care of it as long as the program is running. If you don't specify a trigger interval Spark will process this stream continuously (each batch starts once the last has finished)

To specify a trigger interval see df.trigger() here and in the docs

:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM