[英]How to write to a Spark SQL table from a Panda data frame using PySpark?
[英]How to get the table name from Spark SQL Query [PySpark]?
要從 SQL 查詢中獲取表名,
select *
from table1 as t1
full outer join table2 as t2
on t1.id = t2.id
我在 Scala How to get table names from SQL 查詢中找到了解決方案?
def getTables(query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}
當我遍歷返回序列getTables(query).foreach(println)
時,這給了我正確的表名
table1
table2
PySpark 的等效語法是什么? 我遇到的最接近的是How to extract column name and column type from SQL in pyspark
plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan(query)
print(f"table: {plan.tableDesc().identifier().table()}")
回溯失敗
Py4JError: An error occurred while calling o78.tableDesc. Trace:
py4j.Py4JException: Method tableDesc([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:835)
我了解,問題源於我需要過濾所有
UnresolvedRelation
類型的計划項目,但我在 python/pyspark 中找不到等效符號
我有一個方法,但相當復雜。 It dumps the Java Object and JSON (a poor man's serialization process), deserializes it to python object, filter and parse the table names
import json
def get_tables(query: str):
plan = spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
plan_items = json.loads(plan.toJSON())
for plan_item in plan_items:
if plan_item['class'] == 'org.apache.spark.sql.catalyst.analysis.UnresolvedRelation':
yield plan_item['tableIdentifier']['table']
當我遍歷 function list(get_tables(query))
時,會產生['fast_track_gv_nexus', 'buybox_gv_nexus']
注意不幸的是,這對CTE 不利
例子
with delta as (
select *
group by id
cluster by id
)
select *
from ( select *
FROM
(select *
from dmm
inner join delta on dmm.id = delta.id
)
)
為了解決它,我必須通過正則表達式來破解
import json
import re
def get_tables(query: str):
plan = spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
plan_items = json.loads(plan.toJSON())
plan_string = plan.toString()
cte = re.findall(r"CTE \[(.*?)\]", plan_string)
for plan_item in plan_items:
if plan_item['class'] == 'org.apache.spark.sql.catalyst.analysis.UnresolvedRelation':
tableIdentifier = plan_item['tableIdentifier']
table = plan_item['tableIdentifier']['table']
database = tableIdentifier.get('database', '')
table_name = "{}.{}".format(database, table) if database else table
if table_name not in cte:
yield table_name
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.