繁体   English   中英

Spark Dataframe 带for循环:优化技术

[英]Spark Dataframe With for Loop: Optimization Technique

我试图实现打击逻辑。

    1. Taking some records from one table.
    2. based on resultant data I'm using one loop.
    3.then inside loop taking data from other tables in two different dataframe
    4. joining these two dataframes and loading data into 3rd table.

    var id_chck1 = s"select distinct id ,id1, id2  from table  WHERE type =  'N';
    val id_chck = hive.executeQuery(id_chck1)
    for (data <- id_chck) {

   var id = data(0)
    var id1 = data(1)
    var id2 = data(2)

      val values_1 = "select distinct bill, bil_num, id_num,  bill_date,process_date from table l WHERE id2 = '222';
      val values_1_data = hive.executeQuery(values_1)
      for (row <- values_1_data.collect) {
        val bill = row.mkString(",").split(",")(0)
        val bil_num = row.mkString(",").split(",")(1)
        val id_num= row.mkString(",").split(",")(2)
        val bill_date = row.mkString(",").split(",")(3)

        var df1 = s"select column name from tablename where id=222"
        val df1_data = hive.executeQuery(df1)
        var df2 = s"s"select column name from tablename2 where id=222""
        val df2_data = hive.executeQuery(df2)

      val df3="joining df1 and df2"
        df3.write.format("orc").mode("Append").save("hdfslocation")
      }
      var load1 = s"load data inpath 'hdfslocation' into table tablename"
      val load1_data = hive.executeUpdate(load1)

但是这个过程需要 6 个多小时,有没有其他方法可以做同样的事情,所以它可以在短时间内完成。有没有其他方法可以做同样的事情……比如 rdd 或设置一些 spark hive 属性来提高性能。 我在 test1 表中有 5,00,000 条记录。

您能否添加输入和预期的 output 作为示例? 很难看出你到底想达到什么目的

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM