简体   繁体   English

如何编写一个在 PySpark 数据框中的某些列上运行某些 SQL 的函数?

[英]How to write a function that runs certain SQL on certain columns in a PySpark dataframe?

在此处输入图像描述

I wrote some code and have this as output.我写了一些代码并将其作为输出。 The left side is basically the columns of a dataframe that I'm working with, and the right side is the SQL query that needs to be run on that particular column.左侧基本上是我正在使用的数据框的列,右侧是需要在该特定列上运行的 SQL 查询。

Now I want to write a function that runs the queries on the right on the columns on the left and display the output.现在我想编写一个函数,在左侧的列上运行右侧的查询并显示输出。

The first picture is basically the values of the 'Column' and 'Query' columns of another dataframe.第一张图片基本上是另一个数据框的“列”和“查询”列的值。 I used .collect() methods to retrieve those values.我使用 .collect() 方法来检索这些值。

在此处输入图像描述

This seemed like a simple problem but I'm still stuck at it.这似乎是一个简单的问题,但我仍然坚持。 Any idea how to do it?知道怎么做吗?

You can put column names and queries to a dictionary:您可以将列名和查询放入字典:

dct = {'column_name': 'SELECT * FROM table WHERE {col} IS NULL'}

for k, v in dct.items():
    q = v.format(col = k)
    # spark.sql(q)
    print(q)

Output:输出:

SELECT * FROM table WHERE column_name IS NULL

Using a subset of your data使用数据的子集

data_ls = [
    ('maxpulse', 'select count(*) from {table} where {col} is null'),
    ('duration', 'select round((count(distinct {col}) / count({col})) * 100) from {table}')
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['column', 'query'])

# +--------+-----------------------------------------------------------------------+
# |column  |query                                                                  |
# +--------+-----------------------------------------------------------------------+
# |maxpulse|select count(*) from {table} where {col} is null                       |
# |duration|select round((count(distinct {col}) / count({col})) * 100) from {table}|
# +--------+-----------------------------------------------------------------------+

Approach 1: Using UDF方法 1:使用 UDF

def createQuery(query_string, column_name=None, table_name=None):
    if column_name is not None and table_name is None:
        fnlquery = query_string.format(col=column_name, table='{table}')
    elif column_name is None and table_name is not None:
        fnlquery = query_string.format(col='{col}', table=table_name)
    elif column_name is not None and table_name is not None:
        fnlquery = query_string.format(col=column_name, table=table_name)
    else:
        fnlquery = query_string

    return fnlquery

createQueryUDF = func.udf(createQuery, StringType())

data_sdf. \
    withColumn('final_query', createQueryUDF('query', 'column')). \
    select('final_query'). \
    show(truncate=False)

# +-----------------------------------------------------------------------------+
# |final_query                                                                  |
# +-----------------------------------------------------------------------------+
# |select count(*) from {table} where maxpulse is null                          |
# |select round((count(distinct duration) / count(duration)) * 100) from {table}|
# +-----------------------------------------------------------------------------+

Approach 2: Using regexp_replace() sql function方法二:使用regexp_replace() sql 函数

data_sdf. \
    withColumn('final_query', func.expr('regexp_replace(query, "[\{]col[\}]", column)')). \
    select('final_query'). \
    show(truncate=False)

# +-----------------------------------------------------------------------------+
# |final_query                                                                  |
# +-----------------------------------------------------------------------------+
# |select count(*) from {table} where maxpulse is null                          |
# |select round((count(distinct duration) / count(duration)) * 100) from {table}|
# +-----------------------------------------------------------------------------+

Similar approach can be used to replace '{table}' with a table name.可以使用类似的方法将'{table}'替换为表名。 The final queries from the final_query field can then be collected (using .collect() ) and used further to run sql queries.然后可以收集来自final_query字段的最终查询(使用.collect() )并进一步用于运行 sql 查询。

query_list = data_sdf. \
    withColumn('final_query', func.expr('regexp_replace(query, "[\{]col[\}]", column)')). \
    select('final_query'). \
    rdd.map(lambda x: x.final_query). \
    collect()
# ['select count(*) from {table} where maxpulse is null',
#  'select round((count(distinct duration) / count(duration)) * 100) from {table}']

# run the queries by iterating over the list
for query in query_list:
    spark.sql(query)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM