How to add a column and a batch_Id value to a delta table using a running pyspark streaming job?

Question

I'm trying to add a batch Id for each row in the current batch run and then write it to a delta table. A batch in my case is one CSV file with multiple values. I generate my batch Id value with a function. I can successfully add the correct batch Id when I set my streaming job to execute once but when I set it to await termination it then only executes my generate_id() function once and then adds that value as a batch Id every time I upload a CSV file to my ADLS gen2 container. I need it to execute my generate_id() function and get the new value every time it picks up a new CSV file. Please see my code below. I use a Synapse notebook to execute my code.

batch_id = 0 
def generate_id():
    global batch_id 
    batch_id = batch_id + 1 
    return batch_id

from pyspark.sql.functions import lit

stream = spark \
  .readStream \
  .option("maxFilesPerTrigger", 1) \
  .schema(customSchema) \
.csv("abfss://synapse@{storageAccountName}.dfs.core.windows.net/delta/putty/streaming_test/csv_files/") \
  .withColumn("Batch_Id",lit(generate_id())
  .writeStream \
  .outputMode("append") \
  .format("delta") \
  .option("checkpointLocation", "abfss://synapse@{storageAccountName}.dfs.core.windows.net/delta/putty/streaming_test/_checkpoints") \
  .option("mergeSchema", "true") \
  .foreachBatch(addCol) \
  .awaitTermination()

This is what I need:

File Number	Value	batch_Id
File1	Val1	1
File1	Val2	1
File1	Val3	1
File2	Val1	2
File3	Val1	3
File3	Val2	3

This is what I get at the moment:

File Number	Value	batch_Id
File1	Val1	1
File1	Val2	1
File1	Val3	1
File2	Val1	1
File3	Val1	1
File3	Val2	1

I've also tried to use the foreachbatch function but that doesn't seem to work:

def addCol(df, epochId):
    df.withColumn("Batch_Id",lit(generate_id()))

stream = spark \
  .readStream \
  .option("maxFilesPerTrigger", 1) \
  .schema(customSchema) \
.csv("abfss://synapse@{storageAccountName}.dfs.core.windows.net/delta/putty/streaming_test/csv_files/") \
  .writeStream \
  .outputMode("append") \
  .format("delta") \
  .option("checkpointLocation", "abfss://synapse@{storageAccountName}.dfs.core.windows.net/delta/putty/streaming_test/_checkpoints") \
  .option("mergeSchema", "true") \
  .foreachBatch(addCol) \
  .toTable("patients")
  .awaitTermination()

This is the error that I get when I run my code. I'm not sure what it means:

AnalysisException: The input source(foreachBatch) is different from the table patients's data source provider(delta).
Traceback (most recent call last):

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 1563, in toTable
    return self._sq(self._jwrite.toTable(tableName))

  File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
    raise converted from None

pyspark.sql.utils.AnalysisException: The input source(foreachBatch) is different from the table patients's data source provider(delta).

I'm new to spark streaming but it feels like something like this should be possible when I keep my streaming job active. Any help will be appreciated.

Answer 1

May be you can try using map() or mapPartition() function to solve this usecase. Something like below might work in your case.

You can call your generate Batch Id function for each row object in your dataframe.

df.mapPartitions(iterator => {
    val resultList = new List
    entityIterator.foreach(rowObject => {
        val batchId = generateBatchId()
        val fileNumber = rowObject.getAs("fileNumber")
        val value = rowObject.getAs("value")
        val rowData = Row(fileNumber,value,batchId)
        itr.add(rowData)
       }
}

Answer 2

deltadf = DeltaTable.forName(spark,table_name)

def mergeToDF(microBatchDf, batchid):
  microBatchDf = microBatchDf.withColumn("Batch_ID",increment_id()).withColumn("Test_ID",lit(test_id))
  (deltadf.alias("target").merge(
     source=microBatchDf.alias("source"),
     condition = f"source.RECID = target.RECID"
  ).whenMatchedUpdateAll()
   .whenNotMatchedInsertAll()
   .execute())

spark.readStream.format("csv") \
  .option("maxFilesPerTrigger", 1) \
  .schema(table_schema) \
  .load(f"data_lake_path/{table_name}") \
  .writeStream.format("delta") \
  .outputMode("append") \
  .foreachBatch(mergeToDF) \
  .option("mergeSchema",True) \
  .option("checkpointLocation","data_lake_path/_checkpoints") \
  .start(save_path)

How to add a column and a batch_Id value to a delta table using a running pyspark streaming job?

Question

2 answers

solution1
0 2022-06-10 15:48:14

solution2
0 2022-06-20 10:32:36

How to add a column and a batch_Id value to a delta table using a running pyspark streaming job?

Question

2 answers

solution1 0 2022-06-10 15:48:14

solution2 0 2022-06-20 10:32:36

solution1
0 2022-06-10 15:48:14

solution2
0 2022-06-20 10:32:36