[英]How can I refresh a Hive/Impala table from Spark Structured Streaming?
currently my Spark Structured Streaming goes like this (Sink part displayed only):目前我的 Spark Structured Streaming 是这样的(仅显示接收器部分):
//Output aggregation query to Parquet in append mode
aggregationQuery.writeStream
.format("parquet")
.trigger(Trigger.ProcessingTime("15 seconds"))
.partitionBy("date", "hour")
.option("path", "hdfs://<myip>:8020/user/myuser/spark/proyecto3")
.option("checkpointLocation", "hdfs://<myip>:8020/user/myuser/spark/checkpointfolder3")
.outputMode("append")
.start()
The above code generates .parquet
files in the directory defined by path .上面的代码在path定义的目录中生成了.parquet
文件。
I have externally defined a Impala table that reads from that path, but I need the table to be updated or refreshed after every append of parquet
files.我在外部定义了一个从该路径读取的Impala表,但我需要在每次附加parquet
文件后更新或刷新该表。
How can this be achieved?如何做到这一点?
You need to update the partitions of your table after file sink.您需要在文件接收器后更新表的分区。
import spark.sql
val query1 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (date='20200803') LOCATION '/your/location/proyecto3/date=20200803'"
sql(s"$query1")
import spark.sql
val query2 = "ALTER TABLE proyecto3 ADD IF NOT EXISTS PARTITION (hour='104700') LOCATION '/your/location/proyecto3/date=20200803/hour=104700'"
sql(s"$query2")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.