简体   繁体   English

在 Apache SPARK 中使用分层查询

[英]Use of Hierarchical queries in Apache SPARK

I am trying to run below SQL query in SPARK using Java :我正在尝试使用 Java 在 SPARK 中运行以下 SQL 查询:

Dataset<Row> perIDDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", "CI_PER_PER").load();


            perIDDf.createOrReplaceTempView("CI_PER_PER");
            Dataset<Row> perPerDF = sparkSession.sql("select per_id1,per_id2 " + 
                    "from CI_PER_PER " + 
                    "start with per_id1='2001822000' " + 
                    "connect by prior per_id1=per_id2");
            perPerDF.show(10,false);

I am getting below error:我收到以下错误:

Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'with' expecting <EOF>(line 1, pos 45)

== SQL ==
select per_id1,per_id2 from CI_PER_PER start with per_id1='2001822000' connect by prior per_id1=per_id2
---------------------------------------------^^^

        at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
        at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
        at com.tfmwithspark.TestMaterializedView.main(TestMaterializedView.java:127)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Basically I am trying to use Hierarchical query in SPARK.基本上我想在 SPARK 中使用分层查询。

Is it not supported?不支持吗?

SPARK VERSION : 2.3.0火花版本:2.3.0

Hierarchical query is not supported with Spark currently, nor recursion in the query. Spark 目前不支持分层查询,查询中也不支持递归。 WITH in the most limited fashion, is.以最有限的方式,是。

You can approximate this, but is is arduous.你可以近似这个,但是很费劲。 Here is an approach, but I do not really recommend it: http://sqlandhadoop.com/how-to-implement-recursive-queries-in-spark/这是一种方法,但我并不真正推荐它: http : //sqlandhadoop.com/how-to-implement-recursive-queries-in-spark/

PR for this is already raised check this PR为此已经提出检查这个

work around what you can do is below:解决您可以做的事情如下:

parent_query = """
SELECT asset_id as parent_id FROM {0}.{1}
where name = 'ROOT'
""".format(db_name,table_name)

parent_df = spark.sql(parent_query)
final_df = parent_df


child_query = """
SELECT parent_id as parent_to_drop,asset_id
FROM
{0}.{1}
""".format(db_name,table_name)

child_df = spark.sql(child_query)

count = 1
while count > 0:

  join_df = child_df.join(parent_df,(child_df.parent_to_drop == parent_df.parent_id)) \
        .drop("parent_to_drop") \
        .drop("parent_id") \
        .withColumnRenamed("asset_id","parent_id")
  count = join_df.count()
  final_df = final_df.union(join_df)
  parent_df = join_df

print("----------final-----------")
print(final_df.count())
final_df.show()

data :数据 : 在此处输入图片说明

result :
----------final-----------

8

+---------+
|parent_id|
+---------+
|        0|
|        1|
|        5|
|        2|
|        7|
|        4|
|        3|
|        6|
+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM