简体   繁体   中英

How should I pass a Spark SQL DataFrame as an argument in Python function?

I have a Spark SQL DataFrame within Jupyter notebook as output_df1 . I want to define a function as follows:

def output_agg(output_table_1):
    output_agg_1 = spark.sql(f"""
    select * from {output_table_1}
    """)
    return output_agg_1

When I call output_agg(output_df1) , I get the following error:

Py4JJavaError                             Traceback (most recent call last)
/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o110.sql.
: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '['  expecting <EOF>

Can you please help with the correct syntax?

Print the SQL query before passing to spark.sql and check if SQL query looks good. Also, share the SQL query in question.

def output_agg(output_table_1):
    query = f"""select * from {output_table_1}"""
    print(query)
    output_agg_1 = spark.sql(query)
    return output_agg_1

If the SQL query looks good, then possible issues can be table is not registered with spark.

As discussed in the comments, since you want to union multiple dfs, you can do something like this

from functools import reduce
from pyspark.sql import DataFrame

dfs_list = [output_df1, output_df2, output_df3, output_df4]
df_combined = reduce(DataFrame.unionAll, dfs_list)

NOTE: Make sure the columns in all the df are in same order

spark sql select needs to give table/temporary table. Register the dataframe as a temporary table first, and then execute the SQL statement.

output_df1.createOrReplaceTempView('output_table')
def output_agg(output_table_1):
    output_agg_1 = spark.sql(f"""
    select * from {output_table_1}
    """)
    return output_agg_1
output_agg('output_table')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM