![](/img/trans.png)
[英]convert nanosecond value into datetime using pyspark in databricks
[英]Upsert in databricks using pyspark
我正在嘗試創建一個 df 並將其存儲為增量表並嘗試執行更新插入。 我在網上找到了這個函數,但只是修改了它以適合我嘗試使用的路徑。
delta_store='s3://raw_data/ETL_test/Delta/'
我創建的 df
Employee = Row("id", "FirstName", "LastName", "Email")
employee1 = Employee('1', 'Basher', 'armbrust', 'bash@gmail.com')
employee2 = Employee('2', 'Daniel', 'meng', 'daniel@stanford.edu')
employee3 = Employee('3', 'Muriel', None, 'muriel@waterloo.edu')
employee4 = Employee('4', 'Rachel', 'wendell', 'rach_3@imaginea.com')
employee5 = Employee('5', 'Zach', 'galifianakis', 'zach_g@pramati.co')
employee6 = Employee('6', 'Ramesh', 'Babu', 'ramesh@pramati.co')
employee7 = Employee('7', 'Bipul', 'Kumar', 'bipul@pramati.co')
employee8 = Employee('8', 'Sampath', 'Kumar', 'sam@pramati.co')
employee9 = Employee('9', 'Anil', 'Reddy', 'anil@pramati.co')
employee10 = Employee('10', 'Mageswaran', 'Dhandapani', 'mageswaran@pramati.co')
compacted_df = spark.createDataFrame([employee1, employee2, employee3, employee4, employee5, employee6, employee7, employee8, employee9, employee10])
display(compacted_df)
插入函數:
def upsert(df, path=DELTA_STORE, is_delete=False):
"""
Stores the Dataframe as Delta table if the path is empty or tries to merge the data if found
df : Dataframe
path : Delta table store path
is_delete: Delete the path directory
"""
if is_delete:
dbutils.fs.rm(path, True)
if os.path.exists(path):
print("Modifying existing table...")
delta_table = DeltaTable.forPath(spark,delta_store)
match_expr = "delta.{} = updates.{}".format("id", "id") and "delta.{} = updates.{}".format("FirstName", "FirstName")
delta_table.alias("delta").merge(
df.alias("updates"), match_expr) \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
else:
print("Creating new Delta table")
df.write.format("delta").save(delta_store)
然后我運行以下代碼來修改數據並遇到如下錯誤:
employee14 = Employee('2', 'Daniel', 'Dang', 'ddang@stanford.edu')
employee15 = Employee('15', 'Anitha', 'Ramasamy', 'anitha@pramati.co')
ingestion_updates_df = spark.createDataFrame([employee14, employee15])
upsert(df=ingestion_updates_df, is_delete=False)
錯誤:
AnalysisException: s3://raw_data/ETL_test/Delta already exists.
有人可以解釋我在這里做錯了什么嗎?
這可能只是一個 python - S3 邏輯錯誤。
這個os.path.exists(path)
可能總是返回 false 因為它只理解 posix 文件系統而不是 S3 blob 存儲路徑。
在第二次進入您的函數時,您的代碼將進入ELSE
分支並最終嘗試在不使用.mode("OVERWRITE")
選項的情況下(再次)保存到同一路徑。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.