简体   繁体   English

如何将 DataFrame 直接保存到 Hive?

[英]How to save DataFrame directly to Hive?

Is it possible to save DataFrame in spark directly to Hive?是否可以将DataFrame中的DataFrame直接保存到 Hive?

I have tried with converting DataFrame to Rdd and then saving as a text file and then loading in hive.我尝试将DataFrame转换为Rdd ,然后另存为文本文件,然后加载到 hive 中。 But I am wondering if I can directly save dataframe to hive但我想知道是否可以直接将dataframe保存到配置单元

You can create an in-memory temporary table and store them in hive table using sqlContext.您可以使用 sqlContext 创建一个内存中临时表并将它们存储在 hive 表中。

Lets say your data frame is myDf.假设您的数据框是 myDf。 You can create one temporary table using,您可以使用以下方法创建一个临时表,

myDf.createOrReplaceTempView("mytempTable") 

Then you can use a simple hive statement to create table and dump the data from your temp table.然后你可以使用一个简单的 hive 语句来创建表并从你的临时表中转储数据。

sqlContext.sql("create table mytable as select * from mytempTable");

Use DataFrameWriter.saveAsTable .使用DataFrameWriter.saveAsTable ( df.write.saveAsTable(...) ) See Spark SQL and DataFrame Guide . ( df.write.saveAsTable(...) ) 请参阅Spark SQL 和 DataFrame 指南

I don't see df.write.saveAsTable(...) deprecated in Spark 2.0 documentation.我没有看到df.write.saveAsTable(...)在 Spark 2.0 文档中被弃用。 It has worked for us on Amazon EMR.它在 Amazon EMR 上对我们有用。 We were perfectly able to read data from S3 into a dataframe, process it, create a table from the result and read it with MicroStrategy.我们完全能够将 S3 中的数据读取到数据帧中,对其进行处理,根据结果创建一个表并使用 MicroStrategy 读取它。 Vinays answer has also worked though.不过,Vinays 的回答也奏效了。

you need to have/create a HiveContext您需要拥有/创建一个 HiveContext

import org.apache.spark.sql.hive.HiveContext;

HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());

Then directly save dataframe or select the columns to store as hive table然后直接保存dataframe或者选择要存储为hive表的列

df is dataframe df 是数据框

df.write().mode("overwrite").saveAsTable("schemaName.tableName");

or或者

df.select(df.col("col1"),df.col("col2"), df.col("col3")) .write().mode("overwrite").saveAsTable("schemaName.tableName");

or或者

df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName");

SaveModes are Append/Ignore/Overwrite/ErrorIfExists SaveModes 是追加/忽略/覆盖/ErrorIfExists

I added here the definition for HiveContext from Spark Documentation,我在这里添加了 Spark 文档中 HiveContext 的定义,

In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext.除了基本的 SQLContext 之外,您还可以创建一个 HiveContext,它提供了基本 SQLContext 提供的功能的超集。 Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.其他功能包括使用更完整的 HiveQL 解析器编写查询的能力、访问 Hive UDF 以及从 Hive 表读取数据的能力。 To use a HiveContext, you do not need to have an existing Hive setup, and all of the data sources available to a SQLContext are still available.要使用 HiveContext,您不需要具有现有的 Hive 设置,并且 SQLContext 可用的所有数据源仍然可用。 HiveContext is only packaged separately to avoid including all of Hive's dependencies in the default Spark build. HiveContext 仅单独打包以避免在默认 Spark 构建中包含 Hive 的所有依赖项。


on Spark version 1.6.2, using "dbName.tableName" gives this error:在 Spark 1.6.2 版上,使用“dbName.tableName”会出现以下错误:

org.apache.spark.sql.AnalysisException: Specifying database name or other qualifiers are not allowed for temporary tables. org.apache.spark.sql.AnalysisException:不允许为临时表指定数据库名称或其他限定符。 If the table name has dots (.) in it, please quote the table name with backticks ().`如果表名中有点(.),请用反引号()引用表名。

Saving to Hive is just a matter of using write() method of your SQLContext:保存到 Hive 只是使用 SQLContext 的write()方法的问题:

df.write.saveAsTable(tableName)

See https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(java.lang.String)https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(java.lang.String)

From Spark 2.2: use DataSet instead DataFrame.从 Spark 2.2 开始:使用 DataSet 代替 DataFrame。

Sorry writing late to the post but I see no accepted answer.很抱歉写这篇文章晚了,但我看不到接受的答案。

df.write().saveAsTable will throw AnalysisException and is not HIVE table compatible. df.write().saveAsTable将抛出AnalysisException并且与 HIVE 表不兼容。

Storing DF as df.write().format("hive") should do the trick!将 DF 存储为df.write().format("hive")应该可以解决问题!

However, if that doesn't work, then going by the previous comments and answers, this is what is the best solution in my opinion (Open to suggestions though).但是,如果这不起作用,那么按照之前的评论和答案,这是我认为最好的解决方案(尽管建议接受)。

Best approach is to explicitly create HIVE table (including PARTITIONED table),最好的方法是显式创建 HIVE 表(包括 PARTITIONED 表),

def createHiveTable: Unit ={
spark.sql("CREATE TABLE $hive_table_name($fields) " +
  "PARTITIONED BY ($partition_column String) STORED AS $StorageType")
}

save DF as temp table,将 DF 保存为临时表,

df.createOrReplaceTempView("$tempTableName")

and insert into PARTITIONED HIVE table:并插入到 PARTITIONED HIVE 表中:

spark.sql("insert into table default.$hive_table_name PARTITION($partition_column) select * from $tempTableName")
spark.sql("select * from default.$hive_table_name").show(1000,false)

Offcourse the LAST COLUMN in DF will be the PARTITION COLUMN so create HIVE table accordingly!当然,DF 中的LAST COLUMN将是PARTITION COLUMN,因此相应地创建 HIVE 表!

Please comment if it works!如果有效请评论! or not.或不。


--UPDATE-- - 更新 -

df.write()
  .partitionBy("$partition_column")
  .format("hive")
  .mode(SaveMode.append)
  .saveAsTable($new_table_name_to_be_created_in_hive)  //Table should not exist OR should be a PARTITIONED table in HIVE

For Hive external tables I use this function in PySpark:对于 Hive 外部表,我在 PySpark 中使用此函数:

def save_table(sparkSession, dataframe, database, table_name, save_format="PARQUET"):
    print("Saving result in {}.{}".format(database, table_name))
    output_schema = "," \
        .join(["{} {}".format(x.name.lower(), x.dataType) for x in list(dataframe.schema)]) \
        .replace("StringType", "STRING") \
        .replace("IntegerType", "INT") \
        .replace("DateType", "DATE") \
        .replace("LongType", "INT") \
        .replace("TimestampType", "INT") \
        .replace("BooleanType", "BOOLEAN") \
        .replace("FloatType", "FLOAT")\
        .replace("DoubleType","FLOAT")
    output_schema = re.sub(r'DecimalType[(][0-9]+,[0-9]+[)]', 'FLOAT', output_schema)

    sparkSession.sql("DROP TABLE IF EXISTS {}.{}".format(database, table_name))

    query = "CREATE EXTERNAL TABLE IF NOT EXISTS {}.{} ({}) STORED AS {} LOCATION '/user/hive/{}/{}'" \
        .format(database, table_name, output_schema, save_format, database, table_name)
    sparkSession.sql(query)
    dataframe.write.insertInto('{}.{}'.format(database, table_name),overwrite = True)

You could use Hortonworks spark-llap library like this您可以像这样使用 Hortonworks spark-llap

import com.hortonworks.hwc.HiveWarehouseSession

df.write
  .format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
  .mode("append")
  .option("table", "myDatabase.myTable")
  .save()

Here is PySpark version to create Hive table from parquet file.这是从镶木地板文件创建 Hive 表的 PySpark 版本。 You may have generated Parquet files using inferred schema and now want to push definition to Hive metastore.您可能已经使用推断模式生成了 Parquet 文件,现在想要将定义推送到 Hive 元存储。 You can also push definition to the system like AWS Glue or AWS Athena and not just to Hive metastore.您还可以将定义推送到 AWS Glue 或 AWS Athena 等系统,而不仅仅是 Hive 元存储。 Here I am using spark.sql to push/create permanent table.在这里,我使用 spark.sql 来推送/创建永久表。

   # Location where my parquet files are present.
    df = spark.read.parquet("s3://my-location/data/")
    cols = df.dtypes
    buf = []
    buf.append('CREATE EXTERNAL TABLE test123 (')
    keyanddatatypes =  df.dtypes
    sizeof = len(df.dtypes)
    print ("size----------",sizeof)
    count=1;
    for eachvalue in keyanddatatypes:
        print count,sizeof,eachvalue
        if count == sizeof:
            total = str(eachvalue[0])+str(' ')+str(eachvalue[1])
        else:
            total = str(eachvalue[0]) + str(' ') + str(eachvalue[1]) + str(',')
        buf.append(total)
        count = count + 1

    buf.append(' )')
    buf.append(' STORED as parquet ')
    buf.append("LOCATION")
    buf.append("'")
    buf.append('s3://my-location/data/')
    buf.append("'")
    buf.append("'")
    ##partition by pt
    tabledef = ''.join(buf)

    print "---------print definition ---------"
    print tabledef
    ## create a table using spark.sql. Assuming you are using spark 2.1+
    spark.sql(tabledef);

In my case this works fine:在我的情况下,这很好用:

from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase("DatabaseName")
df = spark.read.format("csv").option("Header",True).load("/user/csvlocation.csv")
df.write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("table",<tablename>).save()

Done!!完毕!!

You can read the Data, let you give as "Employee"您可以读取数据,让您以“员工”的身份给出

hive.executeQuery("select * from Employee").show()

For more details use this URL: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive-read-write-operations.html有关更多详细信息,请使用此 URL: https : //docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive-read-write-operations.html

If you want to create a hive table(which does not exist) from a dataframe (some times it fails to create with DataFrameWriter.saveAsTable ).如果您想从数据帧创建一个配置单元表(它不存在) (有时它无法使用DataFrameWriter.saveAsTable创建)。 StructType.toDDL will helps in listing the columns as a string. StructType.toDDL将有助于将列列为字符串。

val df = ...

val schemaStr = df.schema.toDDL # This gives the columns 
spark.sql(s"""create table hive_table ( ${schemaStr})""")

//Now write the dataframe to the table
df.write.saveAsTable("hive_table")

hive_table will be created in default space since we did not provide any database at spark.sql() . hive_table将在默认空间中创建,因为我们没有在spark.sql()提供任何数据库。 stg.hive_table can be used to create hive_table in stg database. stg.hive_table可用于在stg数据库中创建hive_table

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM