简体   繁体   中英

Writing to delta table using spark sql

In python I am trying to create and write to the table TBL in the database DB in Databricks. But I get an exception: A schema mismatch detected when writing to the Delta table . My code is as follows, here df is a pandas dataframe.

from pyspark.sql import SparkSession

DB = database_name
TMP_TBL = temporary_table
TBL = table_name

sesh = SparkSession.builder.getOrCreate()
df_spark = sesh.createDataFrame(df)
df_spark.createOrReplaceTempView(TMP_TABLE)

create_db_query = f"""
    CREATE DATABASE IF NOT EXISTS {DB}
    COMMENT "This is a database"
    LOCATION "/tmp/{DB}"
    """

create_table_query = f"""
    CREATE TABLE IF NOT EXISTS {DB}.{TBL}
    USING DELTA
    TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)
    COMMENT "This is a table"
    LOCATION "/tmp/{DB}/{TBL}";
    """

insert_query = f"""
    INSERT INTO TABLE {DB}.{TBL} select * from {TMP_TBL}
    """

sesh.sql(create_db_query)
sesh.sql(create_table_query)
sesh.sql(insert_query)

The code fails at the last line, insert_query line. When I check the database and table have been created but is of course empty. So the problem lies with that the TMP_TBL and TBL have different schemas, how and where do I define the schema so they match?

If the schema in your table is different from the schema that you have inserted in your data frame, then you will get an error. make sure it should be same performing insert operation and also try this approach:

I reproduce same thing in my environment. I got this output.

ddl_query = """CREATE TABLE if not exists test123.emp_file 
                   USING DELTA
                   LOCATION 'dbfs:/user/dem1231'
                   """
spark.sql(ddl_query)

insert_query = f"""
    INSERT INTO TABLE test123.emp_file select * from temp_table
    """

在此处输入图像描述

Or

Try this Alternative approach to insert data into table.

I have a data frame like this with a predefined schema

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

#sample datafram
data = [
            ("vamsi","1","M",2000),
            ("saideep","2","M",3000),
            ("rakesh","3","M",4000)
          ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])

df = spark.createDataFrame(data=data,schema=schema)

After using the write command with append mode directly you can insert it into the SQL table.

df.write.mode("append").format("delta").saveAsTable("DB.TBL")

在此处输入图像描述

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM