简体   繁体   中英

Upsert in databricks using pyspark

I am trying to create a df and store it as a delta table and trying to perform an upsert. I found this function online but just modified it to suit the path that I am trying to use.

delta_store='s3://raw_data/ETL_test/Delta/'

The df I create

Employee = Row("id", "FirstName", "LastName", "Email")
employee1 = Employee('1', 'Basher', 'armbrust', 'bash@gmail.com')
employee2 = Employee('2', 'Daniel', 'meng', 'daniel@stanford.edu')
employee3 = Employee('3', 'Muriel', None, 'muriel@waterloo.edu')
employee4 = Employee('4', 'Rachel', 'wendell', 'rach_3@imaginea.com')
employee5 = Employee('5', 'Zach', 'galifianakis', 'zach_g@pramati.co')
employee6 = Employee('6', 'Ramesh', 'Babu', 'ramesh@pramati.co')
employee7 = Employee('7', 'Bipul', 'Kumar', 'bipul@pramati.co')
employee8 = Employee('8', 'Sampath', 'Kumar', 'sam@pramati.co')
employee9 = Employee('9', 'Anil', 'Reddy', 'anil@pramati.co')
employee10 = Employee('10', 'Mageswaran', 'Dhandapani', 'mageswaran@pramati.co')

compacted_df = spark.createDataFrame([employee1, employee2, employee3, employee4, employee5, employee6, employee7, employee8, employee9, employee10])

display(compacted_df)

The upsert function:

def upsert(df, path=DELTA_STORE, is_delete=False):
  """
  Stores the Dataframe as Delta table if the path is empty or tries to merge the data if found
  df : Dataframe 
  path : Delta table store path
  is_delete: Delete the path directory
  """
  if is_delete:
    dbutils.fs.rm(path, True)
  if os.path.exists(path):
    print("Modifying existing table...")
    delta_table = DeltaTable.forPath(spark,delta_store)
    match_expr = "delta.{} = updates.{}".format("id", "id")  and "delta.{} = updates.{}".format("FirstName", "FirstName")
    delta_table.alias("delta").merge(
              df.alias("updates"), match_expr) \
              .whenMatchedUpdateAll() \
              .whenNotMatchedInsertAll() \
              .execute()

  else:
    print("Creating new Delta table")
    df.write.format("delta").save(delta_store)

I then run the following code to modify the data and run into an error as follows:

employee14 = Employee('2', 'Daniel', 'Dang', 'ddang@stanford.edu')
employee15 = Employee('15', 'Anitha', 'Ramasamy', 'anitha@pramati.co')
ingestion_updates_df =  spark.createDataFrame([employee14, employee15])
upsert(df=ingestion_updates_df, is_delete=False) 

Error:

AnalysisException: s3://raw_data/ETL_test/Delta already exists.

Can somebody explain what wrong am i doing here?

This might be just a python - S3 logic error.

This os.path.exists(path) probably always returns false because it only understands posix filesystems and not S3 blob store paths.

On the second pass into your function, your code will go down the ELSE branch and end up trying to save (again) to the same path without using a .mode("OVERWRITE") option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM