简体   繁体   中英

How to add a column and a batch_Id value to a delta table using a running pyspark streaming job?

I'm trying to add a batch Id for each row in the current batch run and then write it to a delta table. A batch in my case is one CSV file with multiple values. I generate my batch Id value with a function. I can successfully add the correct batch Id when I set my streaming job to execute once but when I set it to await termination it then only executes my generate_id() function once and then adds that value as a batch Id every time I upload a CSV file to my ADLS gen2 container. I need it to execute my generate_id() function and get the new value every time it picks up a new CSV file. Please see my code below. I use a Synapse notebook to execute my code.

batch_id = 0 
def generate_id():
    global batch_id 
    batch_id = batch_id + 1 
    return batch_id

from pyspark.sql.functions import lit

stream = spark \
  .readStream \
  .option("maxFilesPerTrigger", 1) \
  .schema(customSchema) \
.csv("abfss://synapse@{storageAccountName}.dfs.core.windows.net/delta/putty/streaming_test/csv_files/") \
  .withColumn("Batch_Id",lit(generate_id())
  .writeStream \
  .outputMode("append") \
  .format("delta") \
  .option("checkpointLocation", "abfss://synapse@{storageAccountName}.dfs.core.windows.net/delta/putty/streaming_test/_checkpoints") \
  .option("mergeSchema", "true") \
  .foreachBatch(addCol) \
  .awaitTermination()

This is what I need:

File Number Value batch_Id
File1 Val1 1
File1 Val2 1
File1 Val3 1
File2 Val1 2
File3 Val1 3
File3 Val2 3

This is what I get at the moment:

File Number Value batch_Id
File1 Val1 1
File1 Val2 1
File1 Val3 1
File2 Val1 1
File3 Val1 1
File3 Val2 1

I've also tried to use the foreachbatch function but that doesn't seem to work:

def addCol(df, epochId):
    df.withColumn("Batch_Id",lit(generate_id()))

stream = spark \
  .readStream \
  .option("maxFilesPerTrigger", 1) \
  .schema(customSchema) \
.csv("abfss://synapse@{storageAccountName}.dfs.core.windows.net/delta/putty/streaming_test/csv_files/") \
  .writeStream \
  .outputMode("append") \
  .format("delta") \
  .option("checkpointLocation", "abfss://synapse@{storageAccountName}.dfs.core.windows.net/delta/putty/streaming_test/_checkpoints") \
  .option("mergeSchema", "true") \
  .foreachBatch(addCol) \
  .toTable("patients")
  .awaitTermination()

This is the error that I get when I run my code. I'm not sure what it means:

AnalysisException: The input source(foreachBatch) is different from the table patients's data source provider(delta).
Traceback (most recent call last):

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 1563, in toTable
    return self._sq(self._jwrite.toTable(tableName))

  File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
    raise converted from None

pyspark.sql.utils.AnalysisException: The input source(foreachBatch) is different from the table patients's data source provider(delta).

I'm new to spark streaming but it feels like something like this should be possible when I keep my streaming job active. Any help will be appreciated.

May be you can try using map() or mapPartition() function to solve this usecase. Something like below might work in your case.

You can call your generate Batch Id function for each row object in your dataframe.

df.mapPartitions(iterator => {
    val resultList = new List
    entityIterator.foreach(rowObject => {
        val batchId = generateBatchId()
        val fileNumber = rowObject.getAs("fileNumber")
        val value = rowObject.getAs("value")
        val rowData = Row(fileNumber,value,batchId)
        itr.add(rowData)
       }
}
    
deltadf = DeltaTable.forName(spark,table_name)

def mergeToDF(microBatchDf, batchid):
  microBatchDf = microBatchDf.withColumn("Batch_ID",increment_id()).withColumn("Test_ID",lit(test_id))
  (deltadf.alias("target").merge(
     source=microBatchDf.alias("source"),
     condition = f"source.RECID = target.RECID"
  ).whenMatchedUpdateAll()
   .whenNotMatchedInsertAll()
   .execute())

spark.readStream.format("csv") \
  .option("maxFilesPerTrigger", 1) \
  .schema(table_schema) \
  .load(f"data_lake_path/{table_name}") \
  .writeStream.format("delta") \
  .outputMode("append") \
  .foreachBatch(mergeToDF) \
  .option("mergeSchema",True) \
  .option("checkpointLocation","data_lake_path/_checkpoints") \
  .start(save_path)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM