简体   繁体   English

如何从 Spark SQL Query [PySpark] 中获取表名?

[英]How to get the table name from Spark SQL Query [PySpark]?

To get the table name from a SQL Query,要从 SQL 查询中获取表名,

select *
from table1 as t1
full outer join table2 as t2
  on t1.id = t2.id

I found a solution in Scala How to get table names from SQL query?我在 Scala How to get table names from SQL 查询中找到了解决方案?

def getTables(query: String): Seq[String] = {
  val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
  import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
  logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}

which gives me the correct table names when I iterate over the return sequence getTables(query).foreach(println)当我遍历返回序列getTables(query).foreach(println)时,这给了我正确的表名

table1
table2

What would be the equivalent syntax for PySpark? PySpark 的等效语法是什么? The closest I came across was How to extract column name and column type from SQL in pyspark我遇到的最接近的是How to extract column name and column type from SQL in pyspark

plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan(query)
print(f"table: {plan.tableDesc().identifier().table()}")

which fails with the traceback回溯失败

Py4JError: An error occurred while calling o78.tableDesc. Trace:
py4j.Py4JException: Method tableDesc([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:835)

I understand, the problem stems up from the fact that I need to filter all plan items which are of type UnresolvedRelation but I cannot find an equivalent notation in python/pyspark我了解,问题源于我需要过滤所有UnresolvedRelation类型的计划项目,但我在 python/pyspark 中找不到等效符号

I have an approach, but rather convoluted.我有一个方法,但相当复杂。 It dumps the Java Object and JSON (a poor man's serialization process), deserializes it to python object, filter and parse the table names It dumps the Java Object and JSON (a poor man's serialization process), deserializes it to python object, filter and parse the table names

import json
def get_tables(query: str):
    plan = spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
    plan_items = json.loads(plan.toJSON())
    for plan_item in plan_items:
        if plan_item['class'] == 'org.apache.spark.sql.catalyst.analysis.UnresolvedRelation':
            yield plan_item['tableIdentifier']['table']

which yields ['fast_track_gv_nexus', 'buybox_gv_nexus'] when I iterate over the function list(get_tables(query))当我遍历 function list(get_tables(query))时,会产生['fast_track_gv_nexus', 'buybox_gv_nexus']

Note Unfortunately, this breaks for CTE注意不幸的是,这对CTE 不利

Example例子

with delta as (
   select *
    group by id
    cluster by id
 )
select   *
  from ( select  *
         FROM
          (select   *
            from dmm
            inner join delta on dmm.id = delta.id
           )
  )

And to resolve it, I have to hack around through regular expression为了解决它,我必须通过正则表达式来破解

import json
import re
def get_tables(query: str):
    plan = spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
    plan_items = json.loads(plan.toJSON())
    plan_string = plan.toString()
    cte = re.findall(r"CTE \[(.*?)\]", plan_string)
    for plan_item in plan_items:
        if plan_item['class'] == 'org.apache.spark.sql.catalyst.analysis.UnresolvedRelation':
            tableIdentifier = plan_item['tableIdentifier']
            table =  plan_item['tableIdentifier']['table']   
            database =  tableIdentifier.get('database', '')
            table_name = "{}.{}".format(database, table) if database else table
            if table_name not in cte:
                yield table_name

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM