如何与前一组（月）的同一行执行自连接以在 Pyspark 中引入具有不同表达式的附加列

Question

Had error in calculating the following new columns from value1_1 to value4_4 based on the formula given.根据给定的公式计算从 value1_1 到 value4_4 的以下新列时出错。

Input:输入：

Month_no|value1 |value2 |value3 |value4|
  01    |10     |20     |30     |40    |
  01    |20     |30     |40     |50    |
  01    |30     |40     |50     |60    |
  02    |40     |50     |60     |70    |
  02    |50     |60     |70     |80    |
  02    |60     |70     |80     |90    |
  03    |70     |80     |90     |100   |
  03    |80     |90     |100    |110   |
  03    |90     |100    |110    |120   |

The value1_1 and value2_2 should calculate based on exp: value1 + prev. value1_1 和 value2_2 应该根据 exp: value1 + prev 计算。 month's value1.月值1。 For example, for month_no 02, the value1_1 for the first row should be month_no 01' first row value1 (10) + month_no 02's first row value 1 (40) = 50例如对于month_no 02，第一行的value1_1应该是month_no 01'第一行值1（10）+month_no 02的第一行值1（40）=50

The value3_3 and value4_4 should calculate based on exp: (value3 + value3 of prev month)/ (qrt mnth no#) value3_3 和 value4_4 应该根据 exp 计算：(value3 + value3 of prev month)/(qrt mnth no#)

qtr month no#: the month number within each quarter. qtr month no#：每个季度内的月份数。

If Jan no# is 1
If Feb no# is 2
If Mar no# is 3
If Apr no# is 1
If May no# is 2
If Jun no# is 3

Output : value1_1 and 2_2 is calculated as per one formula and value3_3 and 4_4 is calculated with another formula. Output : value1_1 和 2_2 按一个公式计算， value3_3 和 4_4 按另一个公式计算。

Month_no|value1 |value2 |value3 |value4 |value1_1|value2_2|value3_3   |value4_4   |
01      |10     |20     |30     |40     |10      |20      |30         |40         |
01      |20     |30     |40     |50     |20      |30      |40         |50         |
01      |30     |40     |50     |60     |30      |40      |50         |60         |
02      |40     |50     |60     |70     |50      |70      |45         |55         |
02      |50     |60     |70     |80     |70      |90      |55         |65         |
02      |60     |70     |80     |90     |90      |110     |65         |75         |
03      |70     |80     |90     |100    |120     |150     |45         |51.66666667|
03      |80     |90     |100    |110    |150     |180     |51.66666667|58.33333333|
03      |90     |100    |110    |120    |180     |210     |58.33333333|65         |

I was trying to do for loop on each month with current and previous month by joining and calculating the new values.我试图通过加入和计算新值来对当前和上个月的每个月进行循环。 But for loop comes into performance issue for million no# of records.但是对于数百万条记录，for 循环会出现性能问题。 Any suggestion to resolve as of another approach??有什么建议可以解决另一种方法吗？

Answer 1

Your question is unclear.你的问题不清楚。 However, based on data, I will try to answer it.但是，根据数据，我将尝试回答。

Based on your source data, within each month, the data looks like it is sorted by something.根据您的源数据，在每个月内，数据看起来像是按某些东西排序的。 I will take value_1 as sorting column.我将 value_1 作为排序列。 You can change it to something else based on your logic.您可以根据您的逻辑将其更改为其他内容。 Based on this sorting column, I will generate row_number and use it in self join.基于这个排序列，我将生成 row_number 并在自连接中使用它。

You can try something like below to achieve your results.您可以尝试以下方法来实现您的结果。 The following code is giving proper results in spark 2.x.以下代码在 spark 2.x 中给出了正确的结果。 You may have to tweak it to work in your spark env.您可能需要对其进行调整才能在您的 spark 环境中工作。 Please note that your formula and your result set does not match for Month_no 3.请注意，您的公式和结果集与 Month_no 3 不匹配。

from pyspark.sql import Window
from pyspark.sql.functions import row_number,lit,col,when

#storing your source data and forming it as a list of list
data=""" 01    |10     |20     |30     |40    
  01    |20     |30     |40     |50     
  01    |30     |40     |50     |60     
  02    |40     |50     |60     |70     
  02    |50     |60     |70     |80     
  02    |60     |70     |80     |90     
  03    |70     |80     |90     |100    
  03    |80     |90     |100    |110    
  03    |90     |100    |110    |120    """

data01=data.split('\n')
data02=[ item.split('|') for item in data01 ]

#creating variables with column names for convenience
month_no='Month_no';value1='value1';value2='value2';value3='value3';value4='value4';crownum="rownum";qtrMonthNo="qtrMonthNo";

#creating rdd & df based on your data
df=sc.parallelize(data02).toDF(['Month_no','value1','value2','value3','value4'])
sourcedata=df.selectExpr("cast(trim(month_no) as integer) as Month_no","cast(trim(value1) as integer) as value1","cast(trim(value2) as integer) as value2","cast(trim(value3) as integer) as value3","cast(trim(value4) as integer) as value4")

#Adding rownum to join with appropriate row in same month
rownum_window=Window.partitionBy(month_no).orderBy(value1)
df1=sourcedata.withColumn("rownum",row_number().over(rownum_window))

#preparing dataframes for join
df_left=df1 
df_right=df1.select(*[col(colm).alias("r_"+colm)  for colm in df1.columns ])

#joining dataframes
df_joined=df_left.join(df_right,( df_left.Month_no - 1 == df_right.r_Month_no )  & ( df_left.rownum==df_right.r_rownum )  ,"left").fillna(0)
df_joined=df_joined.withColumn(qtrMonthNo,when(df_joined.Month_no % 3 == 0, 3).otherwise(df_joined.Month_no % 3))
#not required
df_joined.cache()

#calculating value1_1 & value2_2
first_cal=df_joined.select((col("r_value1")+col("value1")).alias("value1_1"),(col("r_value2")+col("value2")).alias("value2_2"),qtrMonthNo,"r_value3","r_value4",*df1.columns)

#calculating value3_3 & value4_4
second_cal=first_cal.select(((col("r_value3")+col("value3")) / col("qtrMonthNo") ).alias("value3_3"),((col("r_value4")+col("value4")) / col("qtrMonthNo") ).alias("value4_4"),*first_cal.columns)

#final dataframe with necessary columns and sorted data
result_df=second_cal.orderBy(month_no,value1).drop(qtrMonthNo,crownum,"r_value3","r_value4")
result_df.show()

如何与前一组（月）的同一行执行自连接以在 Pyspark 中引入具有不同表达式的附加列

问题描述

1 个解决方案

解决方案1
0 2020-07-19 20:07:25

如何与前一组（月）的同一行执行自连接以在 Pyspark 中引入具有不同表达式的附加列

问题描述

1 个解决方案

解决方案1 0 2020-07-19 20:07:25

解决方案1
0 2020-07-19 20:07:25