Use of Hierarchical queries in Apache SPARK

Question

I am trying to run below SQL query in SPARK using Java :

Dataset<Row> perIDDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", "CI_PER_PER").load();


            perIDDf.createOrReplaceTempView("CI_PER_PER");
            Dataset<Row> perPerDF = sparkSession.sql("select per_id1,per_id2 " + 
                    "from CI_PER_PER " + 
                    "start with per_id1='2001822000' " + 
                    "connect by prior per_id1=per_id2");
            perPerDF.show(10,false);

I am getting below error:

Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'with' expecting <EOF>(line 1, pos 45)

== SQL ==
select per_id1,per_id2 from CI_PER_PER start with per_id1='2001822000' connect by prior per_id1=per_id2
---------------------------------------------^^^

        at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
        at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
        at com.tfmwithspark.TestMaterializedView.main(TestMaterializedView.java:127)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Basically I am trying to use Hierarchical query in SPARK.

Is it not supported?

SPARK VERSION : 2.3.0

Answer 1

Hierarchical query is not supported with Spark currently, nor recursion in the query. WITH in the most limited fashion, is.

You can approximate this, but is is arduous. Here is an approach, but I do not really recommend it: http://sqlandhadoop.com/how-to-implement-recursive-queries-in-spark/

Answer 2

PR for this is already raised check this

work around what you can do is below:

parent_query = """
SELECT asset_id as parent_id FROM {0}.{1}
where name = 'ROOT'
""".format(db_name,table_name)

parent_df = spark.sql(parent_query)
final_df = parent_df


child_query = """
SELECT parent_id as parent_to_drop,asset_id
FROM
{0}.{1}
""".format(db_name,table_name)

child_df = spark.sql(child_query)

count = 1
while count > 0:

  join_df = child_df.join(parent_df,(child_df.parent_to_drop == parent_df.parent_id)) \
        .drop("parent_to_drop") \
        .drop("parent_id") \
        .withColumnRenamed("asset_id","parent_id")
  count = join_df.count()
  final_df = final_df.union(join_df)
  parent_df = join_df

print("----------final-----------")
print(final_df.count())
final_df.show()

data :

result :
----------final-----------

8

+---------+
|parent_id|
+---------+
|        0|
|        1|
|        5|
|        2|
|        7|
|        4|
|        3|
|        6|
+---------+

Use of Hierarchical queries in Apache SPARK

Question

2 answers

solution1
2 ACCPTED 2019-02-13 11:44:25

solution2
1 2020-03-02 14:49:41

Use of Hierarchical queries in Apache SPARK

Question

2 answers

solution1 2 ACCPTED 2019-02-13 11:44:25

solution2 1 2020-03-02 14:49:41

solution1
2 ACCPTED 2019-02-13 11:44:25

solution2
1 2020-03-02 14:49:41