如何从 Spark Structured Streaming 刷新 Hive/Impala 表？

Question

currently my Spark Structured Streaming goes like this (Sink part displayed only):目前我的 Spark Structured Streaming 是这样的（仅显示接收器部分）：

    //Output aggregation query to Parquet in append mode
    aggregationQuery.writeStream
      .format("parquet")
      .trigger(Trigger.ProcessingTime("15 seconds"))
      .partitionBy("date", "hour")
      .option("path", "hdfs://<myip>:8020/user/myuser/spark/proyecto3")
      .option("checkpointLocation", "hdfs://<myip>:8020/user/myuser/spark/checkpointfolder3")
      .outputMode("append")
      .start()

The above code generates .parquet files in the directory defined by path .上面的代码在path定义的目录中生成了.parquet文件。

I have externally defined a Impala table that reads from that path, but I need the table to be updated or refreshed after every append of parquet files.我在外部定义了一个从该路径读取的Impala表，但我需要在每次附加parquet文件后更新或刷新该表。

How can this be achieved?如何做到这一点？

Answer 1

You need to update the partitions of your table after file sink.您需要在文件接收器后更新表的分区。

    import spark.sql
    val query1 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (date='20200803') LOCATION '/your/location/proyecto3/date=20200803'"
    sql(s"$query1")

    import spark.sql
    val query2 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (hour='104700') LOCATION '/your/location/proyecto3/date=20200803/hour=104700'"
    sql(s"$query2")

如何从 Spark Structured Streaming 刷新 Hive/Impala 表？

问题描述

1 个解决方案

解决方案1
0 2020-08-04 14:48:32

如何从 Spark Structured Streaming 刷新 Hive/Impala 表？

问题描述

1 个解决方案

解决方案1 0 2020-08-04 14:48:32

解决方案1
0 2020-08-04 14:48:32