简体   繁体   English

通过 Databricks 向 Delta 表插入记录

[英]Inserting Records To Delta Table Through Databricks

I wanted to insert 100,000 records into a delta table using databricks.我想使用数据块将 100,000 条记录插入到增量表中。 I am trying to insert data by using a simple for loop, something like -我正在尝试通过使用简单的 for 循环来插入数据,例如 -

revision_date = '01/04/2022'
for i in range( 0 , 100,000):
    spark.sql(""" insert into db.delta_table_name values ( 'Class1' , '{revision_date}' + i """)

The problem is, it takes awfully long to insert data using insert statement in databricks.问题是,在数据块中使用插入语句插入数据需要很长时间。 It took almost 5+ hours to complete this.完成这个花了将近 5 个多小时。 Can anyone suggest an alternative or a solution for this problem in databricks.任何人都可以为数据块中的这个问题提出替代方案或解决方案。

My Cluster configuration is - 168 GB, 24 core, DBR 9.1 LTS,Spark 3.1.2我的集群配置是 - 168 GB,24 核,DBR 9.1 LTS,Spark 3.1.2

The loop through enormous INSERT operations on Delta Table costs a lot because it involves a new Transaction Logging for every single INSERT command.在 Delta Table 上循环大量的 INSERT 操作成本很高,因为它涉及每个 INSERT 命令的新事务日志记录。 May read more on the doc .可以阅读更多关于文档的信息。

Instead, it would be better to create a whole Spark dataframe first and then execute just one WRITE operation to insert data into Delta Table.相反,最好先创建一个完整的 Spark dataframe,然后只执行一个 WRITE 操作将数据插入 Delta Table。 The example code below will do in less than a minute.下面的示例代码将在不到一分钟内完成。

from pyspark.sql.functions import expr, row_number, lit, to_date, date_add
from pyspark.sql.window import Window
columns = ['col1']
rows = [['Class1']]
revision_date = '01/04/2022'

# just create a one record dataframe
df = spark.createDataFrame(rows, columns)

# duplicate to 100,000 records
df = df.withColumn('col1', expr('explode(array_repeat(col1,100000))'))

# create date column
df = df.withColumn('revision_date', lit(revision_date))
df = df.withColumn('revision_date', to_date('revision_date', 'dd/MM/yyyy'))

# create sequence column 
w = Window().orderBy(lit('X'))
df = df.withColumn("col2", row_number().over(w))

# use + operation to add date
df = df.withColumn("revision_date", df.revision_date + df.col2)

# drop unused column
df = df.drop("col2")

# write to the delta table location
df.write.format('delta').mode('overwrite').save('/location/of/your/delta/table')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 遍历数据块仓库中的表并使用 pyspark 将某些值提取到另一个增量表中 - loop through tables in databricks warehouse and extract certain values into another delta table with pyspark Databricks 是“更新 Delta 表的状态” - Databricks is "Updating the Delta table's state" 使用 python 在 Databricks 中截断增量表 - Truncate delta table in Databricks using python 模块“dlt”没有属性“表”-databricks 和增量实时表 - Module 'dlt' has no attribute 'table' - databricks and delta live tables Python Azure Databricks 创建增量表异常:不存在事务日志 - Python Azure Databricks create delta table exception: no transaction log present 如何使用文本文件中的列名在数据块中创建增量表的模式 - how to create schema of a delta table in databricks by using column names from text file 在数据块中创建具有当前日期的版本副本后,将增量表恢复到以前的版本 - Restore delta table to previous version after creating a copy of version with current date in databricks 将存储帐户 Azure 转换为 Databricks 增量表 - Convert storage account Azure into Databricks delta tables 通过 Streamlit 向 PostgreSQL 表插入数据 - Inserting data to a PostgreSQL table through Streamlit 避免插入那些已经在 SQL 表中的记录 - Avoid Inserting those records which is already in SQL table
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM