通过 Databricks 向 Delta 表插入记录

Question

I wanted to insert 100,000 records into a delta table using databricks.我想使用数据块将 100,000 条记录插入到增量表中。 I am trying to insert data by using a simple for loop, something like -我正在尝试通过使用简单的 for 循环来插入数据，例如 -

revision_date = '01/04/2022'
for i in range( 0 , 100,000):
    spark.sql(""" insert into db.delta_table_name values ( 'Class1' , '{revision_date}' + i """)

The problem is, it takes awfully long to insert data using insert statement in databricks.问题是，在数据块中使用插入语句插入数据需要很长时间。 It took almost 5+ hours to complete this.完成这个花了将近 5 个多小时。 Can anyone suggest an alternative or a solution for this problem in databricks.任何人都可以为数据块中的这个问题提出替代方案或解决方案。

My Cluster configuration is - 168 GB, 24 core, DBR 9.1 LTS,Spark 3.1.2我的集群配置是 - 168 GB，24 核，DBR 9.1 LTS，Spark 3.1.2

Answer 1

The loop through enormous INSERT operations on Delta Table costs a lot because it involves a new Transaction Logging for every single INSERT command.在 Delta Table 上循环大量的 INSERT 操作成本很高，因为它涉及每个 INSERT 命令的新事务日志记录。 May read more on the doc .可以阅读更多关于文档的信息。

Instead, it would be better to create a whole Spark dataframe first and then execute just one WRITE operation to insert data into Delta Table.相反，最好先创建一个完整的 Spark dataframe，然后只执行一个 WRITE 操作将数据插入 Delta Table。 The example code below will do in less than a minute.下面的示例代码将在不到一分钟内完成。

from pyspark.sql.functions import expr, row_number, lit, to_date, date_add
from pyspark.sql.window import Window
columns = ['col1']
rows = [['Class1']]
revision_date = '01/04/2022'

# just create a one record dataframe
df = spark.createDataFrame(rows, columns)

# duplicate to 100,000 records
df = df.withColumn('col1', expr('explode(array_repeat(col1,100000))'))

# create date column
df = df.withColumn('revision_date', lit(revision_date))
df = df.withColumn('revision_date', to_date('revision_date', 'dd/MM/yyyy'))

# create sequence column 
w = Window().orderBy(lit('X'))
df = df.withColumn("col2", row_number().over(w))

# use + operation to add date
df = df.withColumn("revision_date", df.revision_date + df.col2)

# drop unused column
df = df.drop("col2")

# write to the delta table location
df.write.format('delta').mode('overwrite').save('/location/of/your/delta/table')

通过 Databricks 向 Delta 表插入记录

问题描述

1 个解决方案

解决方案1
3 2022-05-05 04:31:17

通过 Databricks 向 Delta 表插入记录

问题描述

1 个解决方案

解决方案1 3 2022-05-05 04:31:17

解决方案1
3 2022-05-05 04:31:17