简体   繁体   English

如何与前一组(月)的同一行执行自连接以在 Pyspark 中引入具有不同表达式的附加列

[英]How to perform self join with same row of previous group(month) to bring in additional columns with different expressions in Pyspark

Had error in calculating the following new columns from value1_1 to value4_4 based on the formula given.根据给定的公式计算从 value1_1 到 value4_4 的以下新列时出错。

Input:输入:

Month_no|value1 |value2 |value3 |value4|
  01    |10     |20     |30     |40    |
  01    |20     |30     |40     |50    |
  01    |30     |40     |50     |60    |
  02    |40     |50     |60     |70    |
  02    |50     |60     |70     |80    |
  02    |60     |70     |80     |90    |
  03    |70     |80     |90     |100   |
  03    |80     |90     |100    |110   |
  03    |90     |100    |110    |120   |

The value1_1 and value2_2 should calculate based on exp: value1 + prev. value1_1 和 value2_2 应该根据 exp: value1 + prev 计算。 month's value1.月值1。 For example, for month_no 02, the value1_1 for the first row should be month_no 01' first row value1 (10) + month_no 02's first row value 1 (40) = 50例如对于month_no 02,第一行的value1_1应该是month_no 01'第一行值1(10)+month_no 02的第一行值1(40)=50

The value3_3 and value4_4 should calculate based on exp: (value3 + value3 of prev month)/ (qrt mnth no#) value3_3 和 value4_4 应该根据 exp 计算:(value3 + value3 of prev month)/(qrt mnth no#)

qtr month no#: the month number within each quarter. qtr month no#:每个季度内的月份数。

If Jan no# is 1
If Feb no# is 2
If Mar no# is 3
If Apr no# is 1
If May no# is 2
If Jun no# is 3

Output : value1_1 and 2_2 is calculated as per one formula and value3_3 and 4_4 is calculated with another formula. Output : value1_1 和 2_2 按一个公式计算, value3_3 和 4_4 按另一个公式计算。

Month_no|value1 |value2 |value3 |value4 |value1_1|value2_2|value3_3   |value4_4   |
01      |10     |20     |30     |40     |10      |20      |30         |40         |
01      |20     |30     |40     |50     |20      |30      |40         |50         |
01      |30     |40     |50     |60     |30      |40      |50         |60         |
02      |40     |50     |60     |70     |50      |70      |45         |55         |
02      |50     |60     |70     |80     |70      |90      |55         |65         |
02      |60     |70     |80     |90     |90      |110     |65         |75         |
03      |70     |80     |90     |100    |120     |150     |45         |51.66666667|
03      |80     |90     |100    |110    |150     |180     |51.66666667|58.33333333|
03      |90     |100    |110    |120    |180     |210     |58.33333333|65         |

I was trying to do for loop on each month with current and previous month by joining and calculating the new values.我试图通过加入和计算新值来对当前和上个月的每个月进行循环。 But for loop comes into performance issue for million no# of records.但是对于数百万条记录,for 循环会出现性能问题。 Any suggestion to resolve as of another approach??有什么建议可以解决另一种方法吗?

Your question is unclear.你的问题不清楚。 However, based on data, I will try to answer it.但是,根据数据,我将尝试回答。

Based on your source data, within each month, the data looks like it is sorted by something.根据您的源数据,在每个月内,数据看起来像是按某些东西排序的。 I will take value_1 as sorting column.我将 value_1 作为排序列。 You can change it to something else based on your logic.您可以根据您的逻辑将其更改为其他内容。 Based on this sorting column, I will generate row_number and use it in self join.基于这个排序列,我将生成 row_number 并在自连接中使用它。

You can try something like below to achieve your results.您可以尝试以下方法来实现您的结果。 The following code is giving proper results in spark 2.x.以下代码在 spark 2.x 中给出了正确的结果。 You may have to tweak it to work in your spark env.您可能需要对其进行调整才能在您的 spark 环境中工作。 Please note that your formula and your result set does not match for Month_no 3.请注意,您的公式和结果集与 Month_no 3 不匹配。

from pyspark.sql import Window
from pyspark.sql.functions import row_number,lit,col,when

#storing your source data and forming it as a list of list
data=""" 01    |10     |20     |30     |40    
  01    |20     |30     |40     |50     
  01    |30     |40     |50     |60     
  02    |40     |50     |60     |70     
  02    |50     |60     |70     |80     
  02    |60     |70     |80     |90     
  03    |70     |80     |90     |100    
  03    |80     |90     |100    |110    
  03    |90     |100    |110    |120    """

data01=data.split('\n')
data02=[ item.split('|') for item in data01 ]

#creating variables with column names for convenience
month_no='Month_no';value1='value1';value2='value2';value3='value3';value4='value4';crownum="rownum";qtrMonthNo="qtrMonthNo";

#creating rdd & df based on your data
df=sc.parallelize(data02).toDF(['Month_no','value1','value2','value3','value4'])
sourcedata=df.selectExpr("cast(trim(month_no) as integer) as Month_no","cast(trim(value1) as integer) as value1","cast(trim(value2) as integer) as value2","cast(trim(value3) as integer) as value3","cast(trim(value4) as integer) as value4")

#Adding rownum to join with appropriate row in same month
rownum_window=Window.partitionBy(month_no).orderBy(value1)
df1=sourcedata.withColumn("rownum",row_number().over(rownum_window))

#preparing dataframes for join
df_left=df1 
df_right=df1.select(*[col(colm).alias("r_"+colm)  for colm in df1.columns ])

#joining dataframes
df_joined=df_left.join(df_right,( df_left.Month_no - 1 == df_right.r_Month_no )  & ( df_left.rownum==df_right.r_rownum )  ,"left").fillna(0)
df_joined=df_joined.withColumn(qtrMonthNo,when(df_joined.Month_no % 3 == 0, 3).otherwise(df_joined.Month_no % 3))
#not required
df_joined.cache()

#calculating value1_1 & value2_2
first_cal=df_joined.select((col("r_value1")+col("value1")).alias("value1_1"),(col("r_value2")+col("value2")).alias("value2_2"),qtrMonthNo,"r_value3","r_value4",*df1.columns)

#calculating value3_3 & value4_4
second_cal=first_cal.select(((col("r_value3")+col("value3")) / col("qtrMonthNo") ).alias("value3_3"),((col("r_value4")+col("value4")) / col("qtrMonthNo") ).alias("value4_4"),*first_cal.columns)

#final dataframe with necessary columns and sorted data
result_df=second_cal.orderBy(month_no,value1).drop(qtrMonthNo,crownum,"r_value3","r_value4")
result_df.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何与前一组(月)的同一行执行自连接以在 Pyspark 中引入其他列 - How to perform self join with same row of previous group(month) to bring in additional columns in Pyspark PySpark:避免自加入的组上个月的最后一个值 - PySpark: Last value of previous month by group avoiding self-join 如何在 pyspark 上按月对 20 多个列进行分组? - How can I group 20+ columns by month on pyspark? 使用 Pyspark 将前一组值分组到当前行 - Group into current row previous group values with Pyspark 如何将具有相同 ID 但两列中的不同值的行分组为一行,将不同的值作为 Pandas 中的列? - How to group rows with same ID but different values in two columns into a single row the different values as columns in Pandas? 使用 pandas dataframe 按列分组,根据月份将当前行和上一行相乘并相加 - using pandas dataframe group by columns, multiply and add each current row & previous row based on month pandas:如何按多列分组并在多列上执行不同的聚合? - pandas: how to group by multiple columns and perform different aggregations on multiple columns? 如何在 Pyspark 中加入多个列? - How to join on multiple columns in Pyspark? 如何使用数据帧与 pyspark 执行三重连接? - How to perform a triple join with pyspark using dataframes? 如何在pyspark中对一组列进行分桶? - How to bucketize a group of columns in pyspark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM