简体   繁体   中英

Delta Table to Spark Streaming to Synapse Table in azure databricks

I need to write and synchronize our merged DELTA Tables to Azure Data warehouse. We are trying to read the Delta Table and but spark streaming doesn't allow Write Streaming to Synapse Tables.

Then I tried reading the DELTA tables in parquet file in DBFS folder we are able to do batch read and insert/write data to Synapse DWH Tables but we are not able to keep the DELTA tables synced with Synapse Tables.

How can we have the latest snapshot only in Synapse tables? We are not using Synapse Analytics workspace and trying to achieve this using spark streaming job in databricks.

Any pointers will be helpful.

You could try to keep the data in sync by appending the new data dataframe in a forEachBatch on your write stream, this method allows for arbitrary ways to write data, you can connect to the Datawarehouse with jdbc if necessary:

df = spark.readStream\
          .format('delta')\
          .load(input_path)

df_write = df.writeStream \
            .format("delta") \
            .foreachBatch(batch_process) \
            .option("checkpointLocation", delat_chk_path) \
            .start(sink_path)\

with your batch function being something like:

def batch_process(df, batchId):
  
    df = df.transformAsNeeded()
    df.write.jdbc(jdbc_url, table=schema_name + "." + table_name, mode="append", properties=connection_properties)

this of course if a incremental insertion is appropiate for your problem, you could also try to keep a "final" table, or temporal view that is filled with a snapshot of the result data that you desire, when generated in databricks, truncate the target table in the datawarehouse and overwrite it with this view(using jdbc of course), this may be slow, there is probably a synapse connector if i remember correctly. You could try a similar approach by writing the files directly into your storage account, in CSV, parquet, or delta, the reference it in synapse with a dataset. You could also do a batch update, maybe with an integration pipeline in datafactory, you could use a databricks or notebook execution to do as you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM