简体   繁体   English

如何使用其架构从 Spark 数据框创建配置单元表?

[英]How to create hive table from Spark data frame, using its schema?

I want to create a hive table using my Spark dataframe's schema.我想使用我的 Spark 数据框架构创建一个配置单元表。 How can I do that?我怎样才能做到这一点?

For fixed columns, I can use:对于固定列,我可以使用:

val CreateTable_query = "Create Table my table(a string, b string, c double)"
sparksession.sql(CreateTable_query) 

But I have many columns in my dataframe, so is there a way to automatically generate such query?但是我的数据框中有很多列,那么有没有办法自动生成这样的查询?

Assuming, you are using Spark 2.1.0 or later and my_DF is your dataframe,假设您使用的是 Spark 2.1.0 或更高版本,而 my_DF 是您的数据框,

//get the schema split as string with comma-separated field-datatype pairs
StructType my_schema = my_DF.schema();
String columns = Arrays.stream(my_schema.fields())
                       .map(field -> field.name()+" "+field.dataType().typeName())
                       .collect(Collectors.joining(","));

//drop the table if already created
spark.sql("drop table if exists my_table");
//create the table using the dataframe schema
spark.sql("create table my_table(" + columns + ") 
    row format delimited fields terminated by '|' location '/my/hdfs/location'");
    //write the dataframe data to the hdfs location for the created Hive table
    my_DF.write()
    .format("com.databricks.spark.csv")
    .option("delimiter","|")
    .mode("overwrite")
    .save("/my/hdfs/location");

The other method using temp table使用临时表的另一种方法

my_DF.createOrReplaceTempView("my_temp_table");
spark.sql("drop table if exists my_table");
spark.sql("create table my_table as select * from my_temp_table");

As per your question it looks like you want to create table in hive using your data-frame's schema.根据您的问题,您似乎想使用数据框的架构在配置单元中创建表。 But as you are saying you have many columns in that data-frame so there are two options但是正如您所说,您在该数据框中有很多列,所以有两个选项

  • 1st is create direct hive table trough data-frame.第一个是通过数据框创建直接配置单元表。
  • 2nd is take schema of this data-frame and create table in hive.第二是获取此数据框的架构并在配置单元中创建表。

Consider this code:考虑这段代码:

package hive.example

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

object checkDFSchema extends App {
  val cc = new SparkConf;
  val sc = new SparkContext(cc)
  val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
  //First option for creating hive table through dataframe 
  val DF = sparkSession.sql("select * from salary")
  DF.createOrReplaceTempView("tempTable")
  sparkSession.sql("Create table yourtable as select * form tempTable")
  //Second option for creating hive table from schema
  val oldDFF = sparkSession.sql("select * from salary")
  //Generate the schema out of dataframe  
  val schema = oldDFF.schema
  //Generate RDD of you data 
  val rowRDD = sc.parallelize(Seq(Row(100, "a", 123)))
  //Creating new DF from data and schema 
  val newDFwithSchema = sparkSession.createDataFrame(rowRDD, schema)
  newDFwithSchema.createOrReplaceTempView("tempTable")
  sparkSession.sql("create table FinalTable AS select * from tempTable")
}

Another way is to use methods available on StructType.. sql , simpleString, TreeString etc...另一种方法是使用 StructType.. sql、simpleString、TreeString 等可用的方法...

You can create DDLs from a Dataframe's schema, Can create Dataframe's schema from your DDLs ..您可以从 Dataframe 的架构创建 DDL,可以从您的 DDL 创建 Dataframe 的架构 ..

Here is one example - ( Till Spark 2.3)这是一个例子——(直到 Spark 2.3)

    // Setup Sample Test Table to create Dataframe from
    spark.sql(""" drop database hive_test cascade""")
    spark.sql(""" create database hive_test""")
    spark.sql("use hive_test")
    spark.sql("""CREATE TABLE hive_test.department(
    department_id int ,
    department_name string
    )    
    """)
    spark.sql("""
    INSERT INTO hive_test.department values ("101","Oncology")    
    """)

    spark.sql("SELECT * FROM hive_test.department").show()

/***************************************************************/

Now I have Dataframe to play with.现在我有 Dataframe 可以玩了。 in real cases you'd use Dataframe Readers to create dataframe from files/databases.在实际情况下,您将使用 Dataframe Readers 从文件/数据库创建数据框。 Let's use it's schema to create DDLs让我们使用它的模式来创建 DDL

  // Create DDL from Spark Dataframe Schema using simpleString function

 // Regex to remove unwanted characters    
    val sqlrgx = """(struct<)|(>)|(:)""".r
 // Create DDL sql string and remove unwanted characters

    val sqlString = sqlrgx.replaceAllIn(spark.table("hive_test.department").schema.simpleString, " ")

// Create Table with sqlString
   spark.sql(s"create table hive_test.department2( $sqlString )")

Spark 2.4 Onwards you can use fromDDL & toDDL methods on StructType - Spark 2.4 以后,您可以在 StructType 上使用 fromDDL 和 toDDL 方法 -

val fddl = """
      department_id int ,
      department_name string,
      business_unit string
      """


    // Easily create StructType from DDL String using fromDDL
    val schema3: StructType = org.apache.spark.sql.types.StructType.fromDDL(fddl)


    // Create DDL String from StructType using toDDL
    val tddl = schema3.toDDL

    spark.sql(s"drop table if exists hive_test.department2 purge")

   // Create Table using string tddl
    spark.sql(s"""create table hive_test.department2 ( $tddl )""")

    // Test by inserting sample rows and selecting
    spark.sql("""
    INSERT INTO hive_test.department2 values ("101","Oncology","MDACC Texas")    
    """)
    spark.table("hive_test.department2").show()
    spark.sql(s"drop table hive_test.department2")

From spark 2.4 onwards you can use the function to get the column names and types (even for nested struct)从 spark 2.4 开始,您可以使用该函数来获取列名和类型(即使对于嵌套结构)

val df = spark.read....

df.schema.toDDL

Here is PySpark version to create Hive table from parquet file.这是从 parquet 文件创建 Hive 表的 PySpark 版本。 You may have generated Parquet files using inferred schema and now want to push definition to Hive metastore.您可能已经使用推断模式生成了 Parquet 文件,现在想要将定义推送到 Hive 元存储。 You can also push definition to the system like AWS Glue or AWS Athena and not just to Hive metastore.您还可以将定义推送到 AWS Glue 或 AWS Athena 等系统,而不仅仅是 Hive 元存储。 Here I am using spark.sql to push/create permanent table.在这里,我使用 spark.sql 推送/创建永久表。

 # Location where my parquet files are present.
 df = spark.read.parquet("s3://my-location/data/")

    cols = df.dtypes
    buf = []
    buf.append('CREATE EXTERNAL TABLE test123 (')
    keyanddatatypes =  df.dtypes
    sizeof = len(df.dtypes)
    print ("size----------",sizeof)
    count=1;
    for eachvalue in keyanddatatypes:
        print count,sizeof,eachvalue
        if count == sizeof:
            total = str(eachvalue[0])+str(' ')+str(eachvalue[1])
        else:
            total = str(eachvalue[0]) + str(' ') + str(eachvalue[1]) + str(',')
        buf.append(total)
        count = count + 1

    buf.append(' )')
    buf.append(' STORED as parquet ')
    buf.append("LOCATION")
    buf.append("'")
    buf.append('s3://my-location/data/')
    buf.append("'")
    buf.append("'")
    ##partition by pt
    tabledef = ''.join(buf)

    print "---------print definition ---------"
    print tabledef
    ## create a table using spark.sql. Assuming you are using spark 2.1+
    spark.sql(tabledef);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM