[英]Reading Hive view created with CTE (With clause) from spark
我对使用 CTE(WITH 子句)创建的 Hive 有一个视图,该视图合并两个表,然后计算以仅显示每个 id 的最新记录。 在我的环境中,我有一个浏览 hive 数据库的工具(DBeaver,非 datalake 开发者必须浏览数据)。
查看代码
CREATE VIEW IF NOT EXISTS db.test_cte_view AS
with cte as (select * from db.test_cte union select * from db.test_cte_2),
tmp as (SELECT id, idate, ROW_NUMBER() over(PARTITION BY id ORDER BY idate desc ) AS row_num from cte)
SELECT cte.* from cte
join (SELECT * from tmp where tmp.row_num =1) tmp_2
on cte.id = tmp_2.id
and cte.idate = tmp_2.idate
问题是:
(这是我们在 Hive 中创建表和视图的主要方式)
我可以轻松浏览 DBeaver,但是,当运行 spark 进程从中读取时,它会失败并显示以下内容:
##pyspark
spark.sql("select * from db.test_cte_view").show()
'Table or view not found: cte; line 3 pos 56'
Traceback (most recent call last):
File "DATA/fs3/hadoop/yarn/local/usercache/ingouagn/appcache/application_1552132357519_15102/container_e378_1552132357519_15102_01_000001/pyspark.zip/pyspark/sql/session.py", line 545, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/DATA/fs3/hadoop/yarn/local/usercache/ingouagn/appcache/application_1552132357519_15102/container_e378_1552132357519_15102_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/DATA/fs3/hadoop/yarn/local/usercache/ingouagn/appcache/application_1552132357519_15102/container_e378_1552132357519_15102_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Table or view not found: cte; line 3 pos 56'
我可以很好地阅读它
##pyspark
spark.sql("select * from db.test_cte_view").show()
但是,当尝试使用 DBeaver 浏览时,它会失败,如下所示:
Query execution failed
Reason:
SQL Error [40000] [42000]: Error while compiling statement: FAILED: SemanticException line 1:330 Failed to recognize predicate 'UNION'. Failed rule: 'identifier' in subquery source in definition of VIEW test_cte_view [
SELECT `gen_attr_0` AS `id`, `gen_attr_1` AS `status`, `gen_attr_2` AS `idate` FROM (SELECT `gen_attr_0`, `gen_attr_1`, `gen_attr_2` FROM ((SELECT `gen_attr_0`, `gen_attr_1`, `gen_attr_2` FROM (SELECT `id` AS `gen_attr_0`, `status` AS `gen_attr_1`, `idate` AS `gen_attr_2` FROM `db`.`test_cte`) AS gen_subquery_0) UNION DISTINCT (SELECT `gen_attr_5`, `gen_attr_6`, `gen_attr_7` FROM (SELECT `id` AS `gen_attr_5`, `status` AS `gen_attr_6`, `idate` AS `gen_attr_7` FROM `db`.`test_cte_2`) AS gen_subquery_1)) AS cte INNER JOIN (SELECT `gen_attr_3`, `gen_attr_4`, `gen_attr_8` FROM (SELECT `gen_attr_3`, `gen_attr_4`, `gen_attr_8` FROM (SELECT gen_subquery_4.`gen_attr_3`, gen_subquery_4.`gen_attr_4`, row_number() OVER (PARTITION BY `gen_attr_3` ORDER BY `gen_attr_4` DESC NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS `gen_attr_8` FROM (SELECT `gen_attr_3`, `gen_attr_4` FROM ((SELECT `gen_attr_3`, `gen_attr_9`, `gen_attr_4` FROM (SELECT `id` AS `gen_attr_3`, `status` AS `gen_attr_9`, `idate` AS `gen_attr_4` FROM `db`.`test_cte`) AS gen_subquery_2) UNION DISTINCT (SELECT `gen_attr_5`, `gen_attr_6`, `gen_attr_7` FROM (SELECT `id` AS `gen_attr_5`, `status` AS `gen_attr_6`, `idate` AS `gen_attr_7` FROM `db`.`test_cte_2`) AS gen_subquery_3)) AS cte) AS gen_subquery_4) AS gen_subquery_5) AS tmp WHERE (`gen_attr_8` = 1)) AS tmp_2 ON ((`gen_attr_0` = `gen_attr_3`) AND (`gen_attr_2` = `gen_attr_4`))) AS cte
] used as test_cte_view at Line 1:14
似乎生成的代码在创建视图的一种方式与另一种方式之间有所不同。
有没有办法让第一个场景(通过直线创建视图并通过 spark sql 访问它)工作?
谢谢你。
火花:2.1.1 ,蜂巢:1.2.1
CREATE TABLE db.test_cte(
id int,
status string,
idate date )
CREATE TABLE db.test_cte_2(
id int,
status string,
idate date )
填充:
insert into db.test_cte values
(1,"green","2019-03-08"),
(2,"green","2019-03-08"),
(3,"green","2019-03-08"),
(1,"red","2019-03-09"),
(1,"yellow","2019-03-10"),
(2,"gray","2019-03-09")
insert into db.test_cte_2 values
(10,"green","2019-03-08"),
(20,"green","2019-03-08"),
(30,"green","2019-03-08"),
(10,"red","2019-03-09"),
(10,"yellow","2019-03-10"),
(20,"gray","2019-03-09")
编辑:
对于任何感兴趣的人,我在 Spark JIRA 上创建了一个问题:
https://issues.apache.org/jira/browse/SPARK-27203
我在 Spark2.1.1.2.6.1.0-129 上遇到了同样的问题。 升级到 Spark2.4 解决了这个问题。
如果升级不是一个选项,这个解决方法在 2.1 上对我有用:
spark.table("db.my_view_with_ctes").registerTempTable("tmp")
spark.sql("select * from tmp")
这比在 Spark2.4 中通过 spark-sql 读取视图运行的时间要长得多(对于我的用例来说是运行时间的 10 倍以上),但它有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.