使用 pyspark 插入数据块

Question

I am trying to create a df and store it as a delta table and trying to perform an upsert.我正在尝试创建一个 df 并将其存储为增量表并尝试执行更新插入。 I found this function online but just modified it to suit the path that I am trying to use.我在网上找到了这个函数，但只是修改了它以适合我尝试使用的路径。

delta_store='s3://raw_data/ETL_test/Delta/'

The df I create我创建的 df

Employee = Row("id", "FirstName", "LastName", "Email")
employee1 = Employee('1', 'Basher', 'armbrust', 'bash@gmail.com')
employee2 = Employee('2', 'Daniel', 'meng', 'daniel@stanford.edu')
employee3 = Employee('3', 'Muriel', None, 'muriel@waterloo.edu')
employee4 = Employee('4', 'Rachel', 'wendell', 'rach_3@imaginea.com')
employee5 = Employee('5', 'Zach', 'galifianakis', 'zach_g@pramati.co')
employee6 = Employee('6', 'Ramesh', 'Babu', 'ramesh@pramati.co')
employee7 = Employee('7', 'Bipul', 'Kumar', 'bipul@pramati.co')
employee8 = Employee('8', 'Sampath', 'Kumar', 'sam@pramati.co')
employee9 = Employee('9', 'Anil', 'Reddy', 'anil@pramati.co')
employee10 = Employee('10', 'Mageswaran', 'Dhandapani', 'mageswaran@pramati.co')

compacted_df = spark.createDataFrame([employee1, employee2, employee3, employee4, employee5, employee6, employee7, employee8, employee9, employee10])

display(compacted_df)

The upsert function:插入函数：

def upsert(df, path=DELTA_STORE, is_delete=False):
  """
  Stores the Dataframe as Delta table if the path is empty or tries to merge the data if found
  df : Dataframe 
  path : Delta table store path
  is_delete: Delete the path directory
  """
  if is_delete:
    dbutils.fs.rm(path, True)
  if os.path.exists(path):
    print("Modifying existing table...")
    delta_table = DeltaTable.forPath(spark,delta_store)
    match_expr = "delta.{} = updates.{}".format("id", "id")  and "delta.{} = updates.{}".format("FirstName", "FirstName")
    delta_table.alias("delta").merge(
              df.alias("updates"), match_expr) \
              .whenMatchedUpdateAll() \
              .whenNotMatchedInsertAll() \
              .execute()

  else:
    print("Creating new Delta table")
    df.write.format("delta").save(delta_store)

I then run the following code to modify the data and run into an error as follows:然后我运行以下代码来修改数据并遇到如下错误：

employee14 = Employee('2', 'Daniel', 'Dang', 'ddang@stanford.edu')
employee15 = Employee('15', 'Anitha', 'Ramasamy', 'anitha@pramati.co')
ingestion_updates_df =  spark.createDataFrame([employee14, employee15])
upsert(df=ingestion_updates_df, is_delete=False)

Error:错误：

AnalysisException: s3://raw_data/ETL_test/Delta already exists.

Can somebody explain what wrong am i doing here?有人可以解释我在这里做错了什么吗？

Answer 1

This might be just a python - S3 logic error.这可能只是一个 python - S3 逻辑错误。

This os.path.exists(path) probably always returns false because it only understands posix filesystems and not S3 blob store paths.这个os.path.exists(path)可能总是返回 false 因为它只理解 posix 文件系统而不是 S3 blob 存储路径。

On the second pass into your function, your code will go down the ELSE branch and end up trying to save (again) to the same path without using a .mode("OVERWRITE") option.在第二次进入您的函数时，您的代码将进入ELSE分支并最终尝试在不使用.mode("OVERWRITE")选项的情况下（再次）保存到同一路径。

使用 pyspark 插入数据块

问题描述

1 个解决方案

解决方案1
0 2020-10-22 22:27:34

使用 pyspark 插入数据块

问题描述

1 个解决方案

解决方案1 0 2020-10-22 22:27:34

解决方案1
0 2020-10-22 22:27:34