简体   繁体   English

如何将dataframe参数化写入hive表

[英]How to parameterize writing dataframe into hive table

I have a list of tables (across different categories) in RBDMS that I want to extract and save in hive and I want to parameterize in such a way that I'll be able to attach the category name to the output location in hive. For example, I have a category "employee", I want to be able to save the extracted table from RDBMS in the format "hive_db.employee_some_other_random_name"我在 RBDMS 中有一个表列表(跨不同类别),我想将其提取并保存在 hive 中,我想以这样一种方式进行参数化,以便能够将类别名称附加到 hive 中的 output 位置。对于例如,我有一个类别“员工”,我希望能够以“hive_db.employee_some_other_random_name”的格式保存从 RDBMS 中提取的表

I have code as below我有如下代码

    val category = "employee"
    val tableList = List("schema.table_1", "schema.table_2", "schema.table_3")

    val tableMap = Map("schema.table_1" -> "table_1",
    "schema.table_2" -> "table_2",
    "schema.table_3" -> "table_3")

    val queryMap = Map("table_1" -> (select * from table_1) tble,
    "table_2" -> (select * from table_2) tble,
    "table_3" -> (select * from table_3) tble)

    val tableBucketMap = Map("table_1" -> "bucketBy(80,\"EMPLOY_ID\",\"EMPLOYE_ST\").sortBy(\"EMPLOY_ST\").format(\"parquet\")",
    "table_2" -> "bucketBy(80, \"EMPLOY_ID\").sortBy(\"EMPLOY_ID\").format(\"parquet\")",
    "table_3" -> "bucketBy(80, \"EMPLOY_ID\", \"SAL_ID\", \"DEPTS_ID\").sortBy(\"EMPLOY_ID\").format(\"parquet\")")

     for (table <- tableList){
       val tableName = tableMap(table)
       val print_start = "STARTING THE EXTRACTION PROCESSING FOR TABLE: %s"
       val print_statement = print_start.format(tableName)
       println(print_statement)

       val extract_query = queryMap(table)
       val query_statement_non = "Query to extract table %s is: "
       val query_statement = query_statement_non.format(tableName)
       println(query_statement + extract_query)


       val extracted_table = spark.read.format("jdbc")
         .option("url", jdbcURL)
         .option("driver", driver_type)
         .option("dbtable", extract_query)
         .option("user", username)
         .option("password", password)
         .option("fetchsize", "20000")
         .option("queryTimeout", "0")
         .load()

       extracted_table.show(5, false)
       //saving extracted table in hive
       val tableBucket = tableBucketMap(table)
       val output_loc = "hive_db.%s_table_extracted_for_%s"
       val hive_location = output_loc.format(category, tableName)
       println(hive_location)

       val saving_table = "%s.write.%s.saveAsTable(\"%s\")"
       saving_table.format(extracted_table, tableBucket, hive_location)
       println(saving_table.format(extracted_table, tableBucket, hive_location))
  
       val print_end = "COMPLETED EXTRACTION PROCESS FOR TABLE: %s"
       val print_end_statement = print_end.format(tableName)
       println(print_end_statement)

I have result below for the first table.第一张表的结果如下。 Same result is applicable to the other tables..同样的结果适用于其他表..

STARTING THE EXTRACTION PROCESSING FOR TABLE: table_1
Query to extract table table_1 is: (select * from table_1) tble
+---------+--------------------+
|EMPLOY_ID|EMPLOYE_NM          |
+---------+--------------------+
|1        |WELLINGTON          |
|2        |SMITH               |
|3        |CURLEY              |
|4        |PENDRAGON           |
|5        |KEESLER             |
+---------+--------------------+
only showing top 5 rows

hive_db.employee_table_extracted_for_table_1

[EMPLOY_ID: int, EMPLOYE_NM: string].write.bucketBy(80, "EMPLOY_ID", "EMPLOYE_NO").sortBy("EMPLOY_ID").format("parquet").saveAsTable("hive_db.employee_table_extracted_for_table_1")

COMPLETED EXTRACTION PROCESS FOR TABLE: table_1

Instead of writing the extracted dataframe into hive, it just printed the column names它没有将提取的 dataframe 写入 hive,而是打印了列名

[EMPLOY_ID: int, EMPLOYE_NM: String].write............saveAsTable("hive_db.employee_table_extracted_for_table_1")

How can I make it to write the DF into hive table?如何才能将 DF 写入 hive 表?

can you try this approach, change your bucket map as like this(i have done for t1, please do the same for t2 & t3),你能试试这个方法吗,像这样改变你的桶 map(我已经为 t1 做了,请为 t2 和 t3 做同样的事情),

    val tableBucketMap = Map("table_1" -> "80,\"employe_st\"")

and replace the df.bucketBy() with enough arguements as like (numBuckets: Int, colName: String, colNames: String*)并用足够的参数替换df.bucketBy() ,例如(numBuckets: Int, colName: String, colNames: String*)

   
           val stringArr=tableBucket.split(",")
           val numBuckets=stringArr(0).toInt
           val colName=stringArr(1)
           
           extracted_table.write.mode("append").bucketBy(numBuckets,colName).format("parquet").saveAsTable(hive_location)
           

this approach will sort out the mentioned issue这种方法将解决上述问题

[EMPLOY_ID: int, EMPLOYE_NM: String].write............saveAsTable("hive_db.employee_table_extracted_for_table_1")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM