![](/img/trans.png)
[英]convert nanosecond value into datetime using pyspark in databricks
[英]Upsert in databricks using pyspark
我正在尝试创建一个 df 并将其存储为增量表并尝试执行更新插入。 我在网上找到了这个函数,但只是修改了它以适合我尝试使用的路径。
delta_store='s3://raw_data/ETL_test/Delta/'
我创建的 df
Employee = Row("id", "FirstName", "LastName", "Email")
employee1 = Employee('1', 'Basher', 'armbrust', 'bash@gmail.com')
employee2 = Employee('2', 'Daniel', 'meng', 'daniel@stanford.edu')
employee3 = Employee('3', 'Muriel', None, 'muriel@waterloo.edu')
employee4 = Employee('4', 'Rachel', 'wendell', 'rach_3@imaginea.com')
employee5 = Employee('5', 'Zach', 'galifianakis', 'zach_g@pramati.co')
employee6 = Employee('6', 'Ramesh', 'Babu', 'ramesh@pramati.co')
employee7 = Employee('7', 'Bipul', 'Kumar', 'bipul@pramati.co')
employee8 = Employee('8', 'Sampath', 'Kumar', 'sam@pramati.co')
employee9 = Employee('9', 'Anil', 'Reddy', 'anil@pramati.co')
employee10 = Employee('10', 'Mageswaran', 'Dhandapani', 'mageswaran@pramati.co')
compacted_df = spark.createDataFrame([employee1, employee2, employee3, employee4, employee5, employee6, employee7, employee8, employee9, employee10])
display(compacted_df)
插入函数:
def upsert(df, path=DELTA_STORE, is_delete=False):
"""
Stores the Dataframe as Delta table if the path is empty or tries to merge the data if found
df : Dataframe
path : Delta table store path
is_delete: Delete the path directory
"""
if is_delete:
dbutils.fs.rm(path, True)
if os.path.exists(path):
print("Modifying existing table...")
delta_table = DeltaTable.forPath(spark,delta_store)
match_expr = "delta.{} = updates.{}".format("id", "id") and "delta.{} = updates.{}".format("FirstName", "FirstName")
delta_table.alias("delta").merge(
df.alias("updates"), match_expr) \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
else:
print("Creating new Delta table")
df.write.format("delta").save(delta_store)
然后我运行以下代码来修改数据并遇到如下错误:
employee14 = Employee('2', 'Daniel', 'Dang', 'ddang@stanford.edu')
employee15 = Employee('15', 'Anitha', 'Ramasamy', 'anitha@pramati.co')
ingestion_updates_df = spark.createDataFrame([employee14, employee15])
upsert(df=ingestion_updates_df, is_delete=False)
错误:
AnalysisException: s3://raw_data/ETL_test/Delta already exists.
有人可以解释我在这里做错了什么吗?
这可能只是一个 python - S3 逻辑错误。
这个os.path.exists(path)
可能总是返回 false 因为它只理解 posix 文件系统而不是 S3 blob 存储路径。
在第二次进入您的函数时,您的代码将进入ELSE
分支并最终尝试在不使用.mode("OVERWRITE")
选项的情况下(再次)保存到同一路径。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.