简体   繁体   中英

Inserting Records To Delta Table Through Databricks

I wanted to insert 100,000 records into a delta table using databricks. I am trying to insert data by using a simple for loop, something like -

revision_date = '01/04/2022'
for i in range( 0 , 100,000):
    spark.sql(""" insert into db.delta_table_name values ( 'Class1' , '{revision_date}' + i """)

The problem is, it takes awfully long to insert data using insert statement in databricks. It took almost 5+ hours to complete this. Can anyone suggest an alternative or a solution for this problem in databricks.

My Cluster configuration is - 168 GB, 24 core, DBR 9.1 LTS,Spark 3.1.2

The loop through enormous INSERT operations on Delta Table costs a lot because it involves a new Transaction Logging for every single INSERT command. May read more on the doc .

Instead, it would be better to create a whole Spark dataframe first and then execute just one WRITE operation to insert data into Delta Table. The example code below will do in less than a minute.

from pyspark.sql.functions import expr, row_number, lit, to_date, date_add
from pyspark.sql.window import Window
columns = ['col1']
rows = [['Class1']]
revision_date = '01/04/2022'

# just create a one record dataframe
df = spark.createDataFrame(rows, columns)

# duplicate to 100,000 records
df = df.withColumn('col1', expr('explode(array_repeat(col1,100000))'))

# create date column
df = df.withColumn('revision_date', lit(revision_date))
df = df.withColumn('revision_date', to_date('revision_date', 'dd/MM/yyyy'))

# create sequence column 
w = Window().orderBy(lit('X'))
df = df.withColumn("col2", row_number().over(w))

# use + operation to add date
df = df.withColumn("revision_date", df.revision_date + df.col2)

# drop unused column
df = df.drop("col2")

# write to the delta table location
df.write.format('delta').mode('overwrite').save('/location/of/your/delta/table')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM