简体   繁体   English

当使用 SPARK 读取视图时,在 HUDI 表上创建 Athena 视图会返回软删除记录

[英]Creating an Athena view on a HUDI table returns soft deleted records when the view is read using SPARK

I have multiple HUDI tables with differing column names and I built a view on top of it to standardize the column names.我有多个具有不同列名的 HUDI 表,我在其上构建了一个视图以标准化列名。 When this view is read from Athena, it returns a correct response.从 Athena 读取此视图时,它会返回正确的响应。 But, when the same view is read using SPARK using spark.read.parquet("<>"), it returns the soft deleted records too.但是,当使用 SPARK 使用 spark.read.parquet("<>") 读取相同的视图时,它也会返回软删除的记录。 I understand a HUDI table needs to be read with spark.read.format("hudi") but since this is a view on it, I have to use spark.read.parquet("").我知道需要使用 spark.read.format("hudi") 读取 HUDI 表,但由于这是一个视图,我必须使用 spark.read.parquet("")。 Is there a way to enforce HUDI to retain only the latest commit in the table and suppress all the old commits?有没有办法强制 HUDI 只保留表中的最新提交并抑制所有旧提交?

Athena view is a virtual table store in the metastore Glue, the best way to have the same result of Athena in Spark is by using AWS Glue as metastore/catalog for your spark session. To do that you can use this lib which allows you to use AWS Glue as an Hive metastore, then you can read the view using spark.read.table("<database name>.<view name>") or via an SQL query: Athena 视图是 Metastore Glue 中的虚拟表存储,在 Spark 中获得与 Athena 相同结果的最佳方法是使用 AWS Glue 作为 Spark session 的元存储/目录。为此,您可以使用此库,它允许您使用 AWS Glue 作为 Hive 元存储,然后您可以使用spark.read.table("<database name>.<view name>")或通过 SQL 查询读取视图:

val df = spark.sql("SELECT * FROM <database name>.<view name>")

Try to avoid spark.read.parquet("") because it doesn't use the hudi metadata at all, if you have issues with Glue, you can use Hive to create the same view you created in Athena for spark.尽量避免spark.read.parquet("") ,因为它根本不使用 hudi 元数据,如果您对 Glue 有疑问,可以使用 Hive 创建您在 Athena 中为 spark 创建的相同视图。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM