[英]How to select columns using dynamic select query using window function
我有示例输入 dataframe 如下,但值(以 m 开头的 clm)列可以是 n 个数字。
customer_id|month_id|m1 |m2 |m3 .......m_n
1001 | 01 |10 |20
1002 | 01 |20 |30
1003 | 01 |30 |40
1001 | 02 |40 |50
1002 | 02 |50 |60
1003 | 02 |60 |70
1001 | 03 |70 |80
1002 | 03 |80 |90
1003 | 03 |90 |100
现在,我必须通过每月分组来根据累积和创建新列。 因此,我使用了 window function。 因为,我将有 n 列而不是带有 for 循环的 withColumn,我需要动态创建一个查询或列表并将其传递给 selectExpr 以计算新列。
例如:
rownum_window = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
df = df.select("*", F.sum(col("m1")).over(rownum_window).alias("n1"))
但是,我想准备一个动态表达式,然后我需要传递给 dataframe select。 我怎样才能做到这一点?
LIKE: expr = ["F.sum(col("m1")).over(rownum_window).alias("n1")", "F.sum(col("m2")).over(rownum_window).alias("n2")", "F.sum(col("m3")).over(rownum_window).alias("n3")", .......]
df = df.select("*', expr)
或者 dataframe select 的任何其他方式我可以创建 select 表达式?
Output:
customer_id|month_id|m1 |m2 |n1 |n2
1001 | 01 |10 |20 |10 |20
1002 | 01 |20 |30 |20 |30
1003 | 01 |30 |40 |30 |40
1001 | 02 |40 |50 |50 |70
1002 | 02 |50 |60 |70 |90
1003 | 02 |60 |70 |90 |110
1001 | 03 |70 |80 |120 |150
1002 | 03 |80 |90 |150 |180
1003 | 03 |90 |100 |180 |210
更新:
import pyspark.sql.functions as F
from pyspark.sql import Window
rownum_window = Window.partitionBy("customer_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0)
expr = [F.sum(F.col("m1")).over(rownum_window).alias("n1"), F.sum(F.col("m2")).over(rownum_window).alias("n2")]
df.select('*', *expr) \
.orderBy('month_id', 'customer_id') \
.show(10, False)
+-----------+--------+---+---+---+---+
|customer_id|month_id|m1 |m2 |n1 |n2 |
+-----------+--------+---+---+---+---+
|1001 |1 |10 |20 |10 |20 |
|1002 |1 |20 |30 |20 |30 |
|1003 |1 |30 |40 |30 |40 |
|1001 |2 |40 |50 |50 |70 |
|1002 |2 |50 |60 |70 |90 |
|1003 |2 |60 |70 |90 |110|
|1001 |3 |70 |80 |120|150|
|1002 |3 |80 |90 |150|180|
|1003 |3 |90 |100|180|210|
+-----------+--------+---+---+---+---+
尝试这个。
expr = [F.sum(col("m1")).over(rownum_window).alias("n1"), F.sum(col("m2")).over(rownum_window).alias("n2"), ...]
df = df.select('*', *expr)
对@Lamanus 的建议稍作修改,以下代码可能有助于解决您的问题,
# pyspark --driver-memory 1G --executor-memory 2G --executor-cores 1 --num-executors 1
from pyspark.sql import Row
from pyspark.sql.functions import *
from pyspark.sql.window import Window
drow = Row("customer_id","month_id","m1","m2","m3","m4")
data=[drow("1001","01","10","20","10","20"),drow("1002","01","20","30","20","30"),drow("1003","01","30","40","30","40"),drow("1001","02","40","50","40","50"),drow("1002","02","50","60","50","60"),drow("1003","02","60","70","60","70"),drow("1001","03","70","80","70","80"),drow("1002","03","80","90","80","90"),drow("1003","03","90","100","90","100")]
df = spark.createDataFrame(data)
df.show()
'''
+-----------+--------+---+---+---+---+
|customer_id|month_id| m1| m2| m3| m4|
+-----------+--------+---+---+---+---+
| 1001| 01| 10| 20| 10| 20|
| 1002| 01| 20| 30| 20| 30|
| 1003| 01| 30| 40| 30| 40|
| 1001| 02| 40| 50| 40| 50|
| 1002| 02| 50| 60| 50| 60|
| 1003| 02| 60| 70| 60| 70|
| 1001| 03| 70| 80| 70| 80|
| 1002| 03| 80| 90| 80| 90|
| 1003| 03| 90|100| 90|100|
+-----------+--------+---+---+---+---+
'''
a = ["m1","m2"]
b = ["m3","m4"]
rownum_window = (Window.partitionBy("customer_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
expr = ["*",sum(col("m1")).over(rownum_window).alias("sum1"), sum(col("m2")).over(rownum_window).alias("sum2"),avg(col("m3")).over(rownum_window).alias("avg1"), avg(col("m4")).over(rownum_window).alias("avg2") ]
df.select(expr).show()
'''
+-----------+--------+---+---+---+---+-----+-----+----+----+
|customer_id|month_id| m1| m2| m3| m4| sum1| sum2|avg1|avg2|
+-----------+--------+---+---+---+---+-----+-----+----+----+
| 1003| 01| 30| 40| 30| 40| 30.0| 40.0|30.0|40.0|
| 1003| 02| 60| 70| 60| 70| 90.0|110.0|45.0|55.0|
| 1003| 03| 90|100| 90|100|180.0|210.0|60.0|70.0|
| 1002| 01| 20| 30| 20| 30| 20.0| 30.0|20.0|30.0|
| 1002| 02| 50| 60| 50| 60| 70.0| 90.0|35.0|45.0|
| 1002| 03| 80| 90| 80| 90|150.0|180.0|50.0|60.0|
| 1001| 01| 10| 20| 10| 20| 10.0| 20.0|10.0|20.0|
| 1001| 02| 40| 50| 40| 50| 50.0| 70.0|25.0|35.0|
| 1001| 03| 70| 80| 70| 80|120.0|150.0|40.0|50.0|
+-----------+--------+---+---+---+---+-----+-----+----+----+
'''
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.