简体   繁体   English

如何从 Spark Structured Streaming 刷新 Hive/Impala 表?

[英]How can I refresh a Hive/Impala table from Spark Structured Streaming?

currently my Spark Structured Streaming goes like this (Sink part displayed only):目前我的 Spark Structured Streaming 是这样的(仅显示接收器部分):

    //Output aggregation query to Parquet in append mode
    aggregationQuery.writeStream
      .format("parquet")
      .trigger(Trigger.ProcessingTime("15 seconds"))
      .partitionBy("date", "hour")
      .option("path", "hdfs://<myip>:8020/user/myuser/spark/proyecto3")
      .option("checkpointLocation", "hdfs://<myip>:8020/user/myuser/spark/checkpointfolder3")
      .outputMode("append")
      .start()

The above code generates .parquet files in the directory defined by path .上面的代码在path定义的目录中生成了.parquet文件。

I have externally defined a Impala table that reads from that path, but I need the table to be updated or refreshed after every append of parquet files.我在外部定义了一个从该路径读取的Impala表,但我需要在每次附加parquet文件后更新或刷新该表。

How can this be achieved?如何做到这一点?

You need to update the partitions of your table after file sink.您需要在文件接收器后更新表的分区。

    import spark.sql
    val query1 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (date='20200803') LOCATION '/your/location/proyecto3/date=20200803'"
    sql(s"$query1")

    import spark.sql
    val query2 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (hour='104700') LOCATION '/your/location/proyecto3/date=20200803/hour=104700'"
    sql(s"$query2")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将 spark 结构化流数据帧插入 Hive 外部表/位置? - How to insert spark structured streaming DataFrame to Hive external table/location? 如何将Spark结构化流数据写入Hive? - How to write Spark Structured Streaming Data into Hive? Apache Spark结构化流(DataStreamWriter)写入Hive表 - Apache Spark Structured Streaming (DataStreamWriter) write to Hive table Spark结构化流写入流到Hive ORC分区外部表 - Spark Structured Streaming Writestream to Hive ORC Partioned External Table 无法从Impala / Hive / Spark SQL访问“ Spark已注册表” - Can't access “spark registered table” from impala/hive/spark sql 如何在Spark Structured Streaming中处理已删除(或更新)的行? - How can I process deleted (or updated) rows in Spark Structured Streaming? 如何在Spark结构化流中向withWatermark添加超时功能 - How can I add timeout functionality to withWatermark in Spark Structured Streaming 如何在特定时间内运行 Spark 结构化流作业? - How can I run a Spark structured streaming job for a certain time? 如何使用Java在Spark结构化流中检查从Kafka获取数据? - How can I check I get data from Kafka in Spark-structured-streaming with Java? 如何在Spark结构化流中使用foreach方法将数据插入HIVE - how to insert data to HIVE using foreach method in spark structured streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM