Append pandas dataframe 到databricks中的现有表

Question

I want to append a pandas dataframe (8 columns) to an existing table in databricks (12 columns), and fill the other 4 columns that can't be matched with None values.我想将 append 和 pandas dataframe （8 列）添加到数据块中的现有表（12 列）中，并填充其他 4 列可以匹配的值“无”。 Here is I've tried:这是我试过的：

spark_df = spark.createDataFrame(df)
spark_df.write.mode("append").insertInto("my_table")

It thrown the error:它抛出了错误：

ParseException: "\nmismatched input ':' expecting (line 1, pos 4)\n\n== SQL ==\n my_table ParseException: "\nmismatched input ':' 期待（第 1 行，第 4 行）\n\n== SQL ==\n my_table

Looks like spark can't handle this operation with unmatched columns, is there any way to achieve what I want?看起来 spark 无法使用不匹配的列处理此操作，有什么方法可以实现我想要的吗？

Answer 1

I think that the most natural course of action would be a select() transformation to add the missing columns to the 8-column dataframe, followed by a unionAll() transformation to merge the two.我认为最自然的做法是 select() 转换将缺失的列添加到 8 列 dataframe，然后是 unionAll() 转换以合并两者。

from pyspark.sql import Row
from pyspark.sql.functions import lit

bigrow = Row(a='foo', b='bar')
bigdf = spark.createDataFrame([bigrow])
smallrow = Row(a='foobar')
smalldf = spark.createDataFrame([smallrow])

fitdf = smalldf.select(smalldf.a, lit(None).alias('b'))

uniondf = bigdf.unionAll(fitdf)

Answer 2

Can you try this你能试试这个

df = spark.createDataFrame(pandas_df)

df_table_struct = sqlContext.sql('select * from my_table limit 0')

for col in set(df_table_struct.columns) - set(df.columns):
    df = df.withColumn(col, F.lit(None))

df_table_struct = df_table_struct.unionByName(df)

df_table_struct.write.saveAsTable('my_table', mode='append')

Append pandas dataframe 到databricks中的现有表

问题描述

2 个解决方案

解决方案1
0 2019-11-18 06:55:50

解决方案2
0 2019-11-18 10:35:37

Append pandas dataframe 到databricks中的现有表

问题描述

2 个解决方案

解决方案1 0 2019-11-18 06:55:50

解决方案2 0 2019-11-18 10:35:37

解决方案1
0 2019-11-18 06:55:50

解决方案2
0 2019-11-18 10:35:37