简体   繁体   English

如何拆分 pyspark dataframe 并创建新列

[英]How to split pyspark dataframe and create new columns

I have sample input dataframe as below, but the value (clm starting with m) columns can be n number.我有示例输入 dataframe 如下,但值(以 m 开头的 clm)列可以是 n 个数字。 Also I used customer_id as a primary key (but, I can have more no# of primary key based on the input data).我还使用 customer_id 作为主键(但是,根据输入数据,我可以拥有更多的主键)。

customer_id|month_id|m1    |m2 |m3 ....to....m_n
1001      |  01    |10     |20    
1002      |  01    |20     |30    
1003      |  01    |30     |40
1001      |  02    |40     |50    
1002      |  02    |50     |60    
1003      |  02    |60     |70
1001      |  03    |70     |80    
1002      |  03    |80     |90    
1003      |  03    |90     |100

Now, based on the input value columns - I have to calculate the new columns based on the cumulative sum or average.现在,基于输入值列 - 我必须根据累积总和或平均值计算新列。 Let's consider an example:让我们考虑一个例子:

cumulative sum on [m1, ......, m10] and 
cumulative avg on [m11, ......., m20] columns 

Based on this I have to calculate new columns.基于此,我必须计算新列。 I have tried it based on the windows function and able to calculate the new columns.我已经根据 windows function 进行了尝试,并且能够计算新列。 But, my problem is because of size of data, I'm doing the calculation one after other with the updated dataframe with new columns.但是,我的问题是由于数据的大小,我正在使用更新的 dataframe 和新列进行计算。

My try:我的尝试:

a = [m1, ......, m10]
b = [m11, ......, m20]
rnum = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
for item in a:
   var = n
   df = df.withColumn(var + item[1:], F.sum(item).over(rnum))
for item in b:
   var = n
   df = df.withColumn(var + item[1:], F.avg(item).over(rnum))

Output data: Output 数据:

customer_id|month_id|m1     |m2    |m11     |m12   |n1   |n2  |n11  |n12
1001       |  01    |10     |20    |10      |20    |10   |20  |10   |20
1002       |  01    |20     |30    |10      |20    |20   |30  |10   |20
1003       |  01    |30     |40    |10      |20    |30   |40  |10   |20
1001       |  02    |40     |50    |10      |20    |50   |35  |10   |20
1002       |  02    |50     |60    |10      |20    |70   |55  |10   |20
1003       |  02    |60     |70    |10      |20    |90   |75  |10   |20
1001       |  03    |70     |80    |10      |20    |120  |75  |10   |20
1002       |  03    |80     |90    |10      |20    |150  |105 |10   |20
1003       |  03    |90     |100   |10      |20    |180  |135 |10   |20

But, can we do the same operation by splitting the dataframe into two, with cumulative sum columns in one and cumulative avg columns in another data frame along with primary key and then do the operation then club the calculated dataframe????但是,我们是否可以通过将 dataframe 分成两部分来执行相同的操作,其中一个中的累积总和列和另一个数据框中的累积平均列以及主键,然后执行操作,然后将计算的 dataframe 合并?

Based on your question, my understanding is that you are trying to split the operation to perform the tasks in parallel and to save time.根据您的问题,我的理解是您正在尝试拆分操作以并行执行任务并节省时间。

You don't have to parallelize the execution since the execution will be automatically parallelized in spark when you perform any operations such as collect(), show(), count(), write on the dataframe you have created.您不必并行化执行,因为当您在您创建的 dataframe 上执行任何操作(例如 collect()、show()、count()、write 等操作时,将在 spark 中自动并行化执行。 This is due to spark's lazy execution这是由于 spark 的延迟执行

If you still want to split the operations for some other reason, you can use threading.如果您出于其他原因仍想拆分操作,则可以使用线程。 Below article will give you more information about threading in pyspark: https://medium.com/@everisUS/threads-in-pyspark-a6e8005f6017下面的文章将为您提供有关 pyspark 中的线程的更多信息: https://medium.com/@everisUS/threads-in-pyspark-a6e8005f6017

DF1 Approach Optimized Logical Plan DF1 方法优化逻辑计划

== Optimized Logical Plan ==
GlobalLimit 21
+- LocalLimit 21
   +- Project [m1#15, m2#16, sum1#27, sum2#38, customer_id#5334, month_id#5335, m3#5338, m4#5339, avg3#465, avg4#474]
      +- Join Inner, ((customer_id#13 = customer_id#5334) && (month_id#14 = month_id#5335))
         :- Project [customer_id#13, month_id#14, m1#15, m2#16, sum1#27, sum2#38]
         :  +- Filter isnotnull(month_id#14)
         :     +- Window [sum(_w0#39) windowspecdefinition(customer_id#13, month_id#14 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum2#38], [customer_id#13], [month_id#14 ASC NULLS FIRST]
         :        +- Project [customer_id#13, month_id#14, m1#15, m2#16, sum1#27, cast(m2#16 as double) AS _w0#39]
         :           +- Window [sum(_w0#28) windowspecdefinition(customer_id#13, month_id#14 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum1#27], [customer_id#13], [month_id#14 ASC NULLS FIRST]
         :              +- Project [customer_id#13, month_id#14, m1#15, m2#16, cast(m1#15 as double) AS _w0#28]
         :                 +- Filter isnotnull(customer_id#13)
         :                    +- LogicalRDD [customer_id#13, month_id#14, m1#15, m2#16, m3#17, m4#18]
         +- Project [customer_id#5334, month_id#5335, m3#5338, m4#5339, avg3#465, avg4#474]
            +- Filter isnotnull(month_id#5335)
               +- Window [avg(_w0#475) windowspecdefinition(customer_id#5334, month_id#5335 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg4#474], [customer_id#5334], [month_id#5335 ASC NULLS FIRST]
                  +- Project [customer_id#5334, month_id#5335, m3#5338, m4#5339, avg3#465, cast(m4#5339 as double) AS _w0#475]
                     +- Window [avg(_w0#466) windowspecdefinition(customer_id#5334, month_id#5335 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg3#465], [customer_id#5334], [month_id#5335 ASC NULLS FIRST]
                        +- Project [customer_id#5334, month_id#5335, m3#5338, m4#5339, cast(m3#5338 as double) AS _w0#466]
                           +- Filter isnotnull(customer_id#5334)
                              +- LogicalRDD [customer_id#5334, month_id#5335, m1#5336, m2#5337, m3#5338, m4#5339]

DF Approach Optimized Logical Plan DF 方法优化的逻辑计划

== Optimized Logical Plan ==
GlobalLimit 21
+- LocalLimit 21
   +- Project [customer_id#0, month_id#1, m1#2, m2#3, m3#4, m4#5, sum1#14, sum2#25, avg3#447, avg4#460]
      +- Window [avg(_w0#461) windowspecdefinition(customer_id#0, month_id#1 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg4#460], [customer_id#0], [month_id#1 ASC NULLS FIRST]
         +- Project [customer_id#0, month_id#1, m1#2, m2#3, m3#4, m4#5, sum1#14, sum2#25, avg3#447, cast(m4#5 as double) AS _w0#461]
            +- Window [avg(_w0#448) windowspecdefinition(customer_id#0, month_id#1 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg3#447], [customer_id#0], [month_id#1 ASC NULLS FIRST]
               +- Project [customer_id#0, month_id#1, m1#2, m2#3, m3#4, m4#5, sum1#14, sum2#25, cast(m3#4 as double) AS _w0#448]
                  +- Window [sum(_w0#26) windowspecdefinition(customer_id#0, month_id#1 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum2#25], [customer_id#0], [month_id#1 ASC NULLS FIRST]
                     +- Project [customer_id#0, month_id#1, m1#2, m2#3, m3#4, m4#5, sum1#14, cast(m2#3 as double) AS _w0#26]
                        +- Window [sum(_w0#15) windowspecdefinition(customer_id#0, month_id#1 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum1#14], [customer_id#0], [month_id#1 ASC NULLS FIRST]
                           +- Project [customer_id#0, month_id#1, m1#2, m2#3, m3#4, m4#5, cast(m1#2 as double) AS _w0#15]
                              +- LogicalRDD [customer_id#0, month_id#1, m1#2, m2#3, m3#4, m4#5]

If you see the above DF Approach Optimized Logical Plan , It has SUM calculation plan during AVG calculation which might be inefficient.如果你看到上面的DF Approach Optimized Logical Plan ,它在 AVG 计算期间有 SUM 计算计划,这可能是低效的。

+- Window [sum(_w0#26) windowspecdefinition(customer_id#0, month_id#1 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum2#25], [customer_id#0], [month_id#1 ASC NULLS FIRST]
                     +- Project [customer_id#0, month_id#1, m1#2, m2#3, m3#4, m4#5, sum1#14, cast(m2#3 as double) AS _w0#26]
                        +- Window [sum(_w0#15) windowspecdefinition(customer_id#0, month_id#1 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum1#14],  

You can narrow down the dataframe size whenever it is possible and proceed with the calculations.您可以尽可能缩小 dataframe 大小并继续计算。 At the same time join plan got added for two datasets in the DF1 Optimized Logical Plan.同时,为 DF1 优化逻辑计划中的两个数据集添加了join计划。 In many cases joins are always slow, so better try to perfomance tune your spark engine execution environment by,在许多情况下,连接总是很慢,因此最好尝试通过以下方式调整 Spark 引擎执行环境的性能:

  • code - repartition & cache
  • configs - executor, driver, memoryOverhead, number of cores

code I have tried with m1,m2,m3,m4 columns.我用m1,m2,m3,m4列尝试过的代码。

# pyspark --driver-memory 1G --executor-memory 2G --executor-cores 1 --num-executors 1
from pyspark.sql import Row
import pyspark.sql.functions as F
from pyspark.sql.window import Window

drow = Row("customer_id","month_id","m1","m2","m3","m4")

data=[drow("1001","01","10","20","10","20"),drow("1002","01","20","30","20","30"),drow("1003","01","30","40","30","40"),drow("1001","02","40","50","40","50"),drow("1002","02","50","60","50","60"),drow("1003","02","60","70","60","70"),drow("1001","03","70","80","70","80"),drow("1002","03","80","90","80","90"),drow("1003","03","90","100","90","100")]

df = spark.createDataFrame(data)

df1=df.select("customer_id","month_id","m3","m4")

a = ["m1","m2"]
b = ["m3","m4"]
rnum = (Window.partitionBy("customer_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
for item in a:
    var = "sum"
    df = df.withColumn(var + item[1:], F.sum(item).over(rnum))
df.show()
'''
+-----------+--------+---+---+---+---+-----+-----+
|customer_id|month_id| m1| m2| m3| m4| sum1| sum2|
+-----------+--------+---+---+---+---+-----+-----+
|       1003|      01| 30| 40| 30| 40| 30.0| 40.0|
|       1003|      02| 60| 70| 60| 70| 90.0|110.0|
|       1003|      03| 90|100| 90|100|180.0|210.0|
|       1002|      01| 20| 30| 20| 30| 20.0| 30.0|
|       1002|      02| 50| 60| 50| 60| 70.0| 90.0|
|       1002|      03| 80| 90| 80| 90|150.0|180.0|
|       1001|      01| 10| 20| 10| 20| 10.0| 20.0|
|       1001|      02| 40| 50| 40| 50| 50.0| 70.0|
|       1001|      03| 70| 80| 70| 80|120.0|150.0|
+-----------+--------+---+---+---+---+-----+-----+
'''
for item in b:
    var = "avg"
    df = df.withColumn(var + item[1:], F.avg(item).over(rnum))
df.show()

'''
+-----------+--------+---+---+---+---+-----+-----+----+----+
|customer_id|month_id| m1| m2| m3| m4| sum1| sum2|avg3|avg4|
+-----------+--------+---+---+---+---+-----+-----+----+----+
|       1003|      01| 30| 40| 30| 40| 30.0| 40.0|30.0|40.0|
|       1003|      02| 60| 70| 60| 70| 90.0|110.0|45.0|55.0|
|       1003|      03| 90|100| 90|100|180.0|210.0|60.0|70.0|
|       1002|      01| 20| 30| 20| 30| 20.0| 30.0|20.0|30.0|
|       1002|      02| 50| 60| 50| 60| 70.0| 90.0|35.0|45.0|
|       1002|      03| 80| 90| 80| 90|150.0|180.0|50.0|60.0|
|       1001|      01| 10| 20| 10| 20| 10.0| 20.0|10.0|20.0|
|       1001|      02| 40| 50| 40| 50| 50.0| 70.0|25.0|35.0|
|       1001|      03| 70| 80| 70| 80|120.0|150.0|40.0|50.0|
+-----------+--------+---+---+---+---+-----+-----+----+----+
'''

for item in b:
    var = "avg"
    df1 = df1.withColumn(var + item[1:], F.avg(item).over(rnum))

'''
+-----------+--------+---+---+----+----+
|customer_id|month_id| m3| m4|avg3|avg4|
+-----------+--------+---+---+----+----+
|       1003|      01| 30| 40|30.0|40.0|
|       1003|      02| 60| 70|45.0|55.0|
|       1003|      03| 90|100|60.0|70.0|
|       1002|      01| 20| 30|20.0|30.0|
|       1002|      02| 50| 60|35.0|45.0|
|       1002|      03| 80| 90|50.0|60.0|
|       1001|      01| 10| 20|10.0|20.0|
|       1001|      02| 40| 50|25.0|35.0|
|       1001|      03| 70| 80|40.0|50.0|
+-----------+--------+---+---+----+----+
'''
#join the DFs after DF1 avg & DF sum calculation.

df2=df.join(df1,(df1.customer_id == df.customer_id)& (df1.month_id == df.month_id)).drop(df.m3).drop(df.m4).drop(df1.month_id).drop(df1.customer_id)

'''
df2.show()
+---+---+-----+-----+-----------+--------+---+---+----+----+
| m1| m2| sum1| sum2|customer_id|month_id| m3| m4|avg3|avg4|
+---+---+-----+-----+-----------+--------+---+---+----+----+
| 10| 20| 10.0| 20.0|       1001|      01| 10| 20|10.0|20.0|
| 70| 80|120.0|150.0|       1001|      03| 70| 80|40.0|50.0|
| 40| 50| 50.0| 70.0|       1001|      02| 40| 50|25.0|35.0|
| 80| 90|150.0|180.0|       1002|      03| 80| 90|50.0|60.0|
| 50| 60| 70.0| 90.0|       1002|      02| 50| 60|35.0|45.0|
| 20| 30| 20.0| 30.0|       1002|      01| 20| 30|20.0|30.0|
| 30| 40| 30.0| 40.0|       1003|      01| 30| 40|30.0|40.0|
| 90|100|180.0|210.0|       1003|      03| 90|100|60.0|70.0|
| 60| 70| 90.0|110.0|       1003|      02| 60| 70|45.0|55.0|
+---+---+-----+-----+-----------+--------+---+---+----+----+
'''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM