简体   繁体   English

Pandas 到 pyspark cumprod 功能

[英]Pandas to pyspark cumprod function

I am trying to convert below pandas code to pyspark我正在尝试将下面的熊猫代码转换为 pyspark

Python Pandas code: Python熊猫代码:

df = spark.createDataFrame([(1, 1,0.9), (1, 2,0.13), (1, 3,0.5), (1, 4,1.0), (1, 5,0.6)], ['col1', 'col2','col3'])
pandas_df = df.toPandas()

pandas_df['col4'] = (pandas_df.groupby(['col1','col2'])['col3'].apply(lambda x: (1 - x).cumprod()))
pandas_df

and the result is below:结果如下:

   col1  col2  col3  col4
0     1     1  0.90  0.10
1     1     2  0.13  0.87
2     1     3  0.50  0.50
3     1     4  1.00  0.00
4     1     5  0.60  0.40

and converted spark code:和转换的火花代码:

from pyspark.sql import functions as F, Window, types
from functools import reduce
from operator import mul

df = spark.createDataFrame([(1, 1,0.9), (1, 2,0.13), (1, 3,0.5), (1, 4,1.0), (1, 5,0.6)], ['col1', 'col2','col3'])
partition_column = ['col1','col2']
window = Window.partitionBy(partition_column)
expr = 1.0 - F.col('col3')
mul_udf = F.udf(lambda x: reduce(mul, x), types.DoubleType())
df = df.withColumn('col4', mul_udf(F.collect_list(expr).over(window)))
df.orderBy('col2').show()

and its output和它的输出

+----+----+----+-------------------+
|col1|col2|col3|               col4|
+----+----+----+-------------------+
|   1|   1| 0.9|0.09999999999999998|
|   1|   2|0.13|               0.87|
|   1|   3| 0.5|                0.5|
|   1|   4| 1.0|                0.0|
|   1|   5| 0.6|                0.4|
+----+----+----+-------------------+

I don't completely understand how pandas work, can someone help me validate whether the above conversion is correct and also I am using UDF, which will reduce the performance, is there any distributed function available in pyspark that will do cumprod() ?我不完全理解熊猫是如何工作的,有人可以帮我验证上述转换是否正确,而且我正在使用 UDF,这会降低性能,pyspark 中是否有任何分布式函数可以执行cumprod()

thanks in advance提前致谢

Since product of positive numbers can be expressed with log and exp functions ( a*b*c = exp(log(a) + log(b) + log(c)) ), you can calculate cumulative product using only Spark built-in functions:由于正数的乘积可以用logexp函数表示( a*b*c = exp(log(a) + log(b) + log(c)) ),您可以仅使用 Spark 内置计算累积乘积职能:

df.groupBy("col1", "col2") \
  .agg(max(col("col3")).alias("col3"),
       coalesce(exp(sum(log(lit(1) - col("col3")))), lit(0)).alias("col4")
  )\
  .orderBy(col("col2"))\
  .show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM