Upsert in databricks using pyspark

Question

I am trying to create a df and store it as a delta table and trying to perform an upsert. I found this function online but just modified it to suit the path that I am trying to use.

delta_store='s3://raw_data/ETL_test/Delta/'

The df I create

Employee = Row("id", "FirstName", "LastName", "Email")
employee1 = Employee('1', 'Basher', 'armbrust', 'bash@gmail.com')
employee2 = Employee('2', 'Daniel', 'meng', 'daniel@stanford.edu')
employee3 = Employee('3', 'Muriel', None, 'muriel@waterloo.edu')
employee4 = Employee('4', 'Rachel', 'wendell', 'rach_3@imaginea.com')
employee5 = Employee('5', 'Zach', 'galifianakis', 'zach_g@pramati.co')
employee6 = Employee('6', 'Ramesh', 'Babu', 'ramesh@pramati.co')
employee7 = Employee('7', 'Bipul', 'Kumar', 'bipul@pramati.co')
employee8 = Employee('8', 'Sampath', 'Kumar', 'sam@pramati.co')
employee9 = Employee('9', 'Anil', 'Reddy', 'anil@pramati.co')
employee10 = Employee('10', 'Mageswaran', 'Dhandapani', 'mageswaran@pramati.co')

compacted_df = spark.createDataFrame([employee1, employee2, employee3, employee4, employee5, employee6, employee7, employee8, employee9, employee10])

display(compacted_df)

The upsert function:

def upsert(df, path=DELTA_STORE, is_delete=False):
  """
  Stores the Dataframe as Delta table if the path is empty or tries to merge the data if found
  df : Dataframe 
  path : Delta table store path
  is_delete: Delete the path directory
  """
  if is_delete:
    dbutils.fs.rm(path, True)
  if os.path.exists(path):
    print("Modifying existing table...")
    delta_table = DeltaTable.forPath(spark,delta_store)
    match_expr = "delta.{} = updates.{}".format("id", "id")  and "delta.{} = updates.{}".format("FirstName", "FirstName")
    delta_table.alias("delta").merge(
              df.alias("updates"), match_expr) \
              .whenMatchedUpdateAll() \
              .whenNotMatchedInsertAll() \
              .execute()

  else:
    print("Creating new Delta table")
    df.write.format("delta").save(delta_store)

I then run the following code to modify the data and run into an error as follows:

employee14 = Employee('2', 'Daniel', 'Dang', 'ddang@stanford.edu')
employee15 = Employee('15', 'Anitha', 'Ramasamy', 'anitha@pramati.co')
ingestion_updates_df =  spark.createDataFrame([employee14, employee15])
upsert(df=ingestion_updates_df, is_delete=False)

Error:

AnalysisException: s3://raw_data/ETL_test/Delta already exists.

Can somebody explain what wrong am i doing here?

Answer 1

This might be just a python - S3 logic error.

This os.path.exists(path) probably always returns false because it only understands posix filesystems and not S3 blob store paths.

On the second pass into your function, your code will go down the ELSE branch and end up trying to save (again) to the same path without using a .mode("OVERWRITE") option.

Upsert in databricks using pyspark

Question

1 answers

solution1
0 2020-10-22 22:27:34

Upsert in databricks using pyspark

Question

1 answers

solution1 0 2020-10-22 22:27:34

solution1
0 2020-10-22 22:27:34