如何将dataframe参数化写入hive表

Question

I have a list of tables (across different categories) in RBDMS that I want to extract and save in hive and I want to parameterize in such a way that I'll be able to attach the category name to the output location in hive. For example, I have a category "employee", I want to be able to save the extracted table from RDBMS in the format "hive_db.employee_some_other_random_name"我在 RBDMS 中有一个表列表（跨不同类别），我想将其提取并保存在 hive 中，我想以这样一种方式进行参数化，以便能够将类别名称附加到 hive 中的 output 位置。对于例如，我有一个类别“员工”，我希望能够以“hive_db.employee_some_other_random_name”的格式保存从 RDBMS 中提取的表

I have code as below我有如下代码

    val category = "employee"
    val tableList = List("schema.table_1", "schema.table_2", "schema.table_3")

    val tableMap = Map("schema.table_1" -> "table_1",
    "schema.table_2" -> "table_2",
    "schema.table_3" -> "table_3")

    val queryMap = Map("table_1" -> (select * from table_1) tble,
    "table_2" -> (select * from table_2) tble,
    "table_3" -> (select * from table_3) tble)

    val tableBucketMap = Map("table_1" -> "bucketBy(80,\"EMPLOY_ID\",\"EMPLOYE_ST\").sortBy(\"EMPLOY_ST\").format(\"parquet\")",
    "table_2" -> "bucketBy(80, \"EMPLOY_ID\").sortBy(\"EMPLOY_ID\").format(\"parquet\")",
    "table_3" -> "bucketBy(80, \"EMPLOY_ID\", \"SAL_ID\", \"DEPTS_ID\").sortBy(\"EMPLOY_ID\").format(\"parquet\")")

     for (table <- tableList){
       val tableName = tableMap(table)
       val print_start = "STARTING THE EXTRACTION PROCESSING FOR TABLE: %s"
       val print_statement = print_start.format(tableName)
       println(print_statement)

       val extract_query = queryMap(table)
       val query_statement_non = "Query to extract table %s is: "
       val query_statement = query_statement_non.format(tableName)
       println(query_statement + extract_query)


       val extracted_table = spark.read.format("jdbc")
         .option("url", jdbcURL)
         .option("driver", driver_type)
         .option("dbtable", extract_query)
         .option("user", username)
         .option("password", password)
         .option("fetchsize", "20000")
         .option("queryTimeout", "0")
         .load()

       extracted_table.show(5, false)
       //saving extracted table in hive
       val tableBucket = tableBucketMap(table)
       val output_loc = "hive_db.%s_table_extracted_for_%s"
       val hive_location = output_loc.format(category, tableName)
       println(hive_location)

       val saving_table = "%s.write.%s.saveAsTable(\"%s\")"
       saving_table.format(extracted_table, tableBucket, hive_location)
       println(saving_table.format(extracted_table, tableBucket, hive_location))
  
       val print_end = "COMPLETED EXTRACTION PROCESS FOR TABLE: %s"
       val print_end_statement = print_end.format(tableName)
       println(print_end_statement)

I have result below for the first table.第一张表的结果如下。 Same result is applicable to the other tables..同样的结果适用于其他表..

STARTING THE EXTRACTION PROCESSING FOR TABLE: table_1
Query to extract table table_1 is: (select * from table_1) tble
+---------+--------------------+
|EMPLOY_ID|EMPLOYE_NM          |
+---------+--------------------+
|1        |WELLINGTON          |
|2        |SMITH               |
|3        |CURLEY              |
|4        |PENDRAGON           |
|5        |KEESLER             |
+---------+--------------------+
only showing top 5 rows

hive_db.employee_table_extracted_for_table_1

[EMPLOY_ID: int, EMPLOYE_NM: string].write.bucketBy(80, "EMPLOY_ID", "EMPLOYE_NO").sortBy("EMPLOY_ID").format("parquet").saveAsTable("hive_db.employee_table_extracted_for_table_1")

COMPLETED EXTRACTION PROCESS FOR TABLE: table_1

Instead of writing the extracted dataframe into hive, it just printed the column names它没有将提取的 dataframe 写入 hive，而是打印了列名

[EMPLOY_ID: int, EMPLOYE_NM: String].write............saveAsTable("hive_db.employee_table_extracted_for_table_1")

How can I make it to write the DF into hive table?如何才能将 DF 写入 hive 表？

Answer 1

can you try this approach, change your bucket map as like this(i have done for t1, please do the same for t2 & t3),你能试试这个方法吗，像这样改变你的桶 map（我已经为 t1 做了，请为 t2 和 t3 做同样的事情），

    val tableBucketMap = Map("table_1" -> "80,\"employe_st\"")

and replace the df.bucketBy() with enough arguements as like (numBuckets: Int, colName: String, colNames: String*)并用足够的参数替换df.bucketBy() ，例如(numBuckets: Int, colName: String, colNames: String*)

   
           val stringArr=tableBucket.split(",")
           val numBuckets=stringArr(0).toInt
           val colName=stringArr(1)
           
           extracted_table.write.mode("append").bucketBy(numBuckets,colName).format("parquet").saveAsTable(hive_location)

this approach will sort out the mentioned issue这种方法将解决上述问题

[EMPLOY_ID: int, EMPLOYE_NM: String].write............saveAsTable("hive_db.employee_table_extracted_for_table_1")

如何将dataframe参数化写入hive表

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-04 09:00:02

如何将dataframe参数化写入hive表

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-04 09:00:02

解决方案1
1 已采纳 2020-08-04 09:00:02