[英]Spark dataframe pivot median per quarter
我正在尝试旋转 Spark 数据框来计算每个季度的中位数,然后添加额外的列来计算两个季度之间的差异
样本数据:
schema = "id INT, amount INT, timestmp STRING"
data = ((1,5000,"06.01.2020 00:39"), \
(1,2340,"26.02.2020 12:35"), \
(1,491,"01.03.2020 04:55"), \
(1,7801,"09.04.2020 14:51"), \
(1,2900,"19.05.2020 00:51"), \
(1,1200,"29.06.2020 10:01"), \
(1,890,"03.07.2020 12:31"), \
(1,3201,"09.08.2020 01:07"), \
(1,4449,"13.09.2020 17:01"), \
(2,3945,"09.01.2020 00:39"), \
(2,1846,"29.02.2020 12:35"), \
(2,387,"04.03.2020 04:55"), \
(2,6155,"12.04.2020 14:51"), \
(2,3542,"22.05.2020 00:51"), \
(2,947,"02.06.2020 10:01"), \
(2,702,"06.07.2020 12:31"), \
(2,1886,"12.08.2020 01:07"), \
(2,3510,"16.09.2020 17:01"))
dfraw = spark.createDataFrame(data, schema)
df = dfraw.withColumn("purch_date", to_date(col("timestmp"),'dd.MM.yyyy')).drop("timestmp")
root
|-- id: integer (nullable = true)
|-- amount: integer (nullable = true)
|-- purch_date: date (nullable = true)
+---+------+----------+
| id|amount|purch_date|
+---+------+----------+
| 1| 5000|2020-01-06|
| 1| 2340|2020-02-26|
| 1| 491|2020-03-01|
| 1| 7801|2020-04-09|
| 1| 2900|2020-05-19|
| 1| 1200|2020-06-29|
| 1| 890|2020-07-03|
| 1| 3201|2020-08-09|
| 1| 4449|2020-09-13|
| 2| 3945|2020-01-09|
| 2| 1846|2020-02-29|
| 2| 387|2020-03-04|
| 2| 6155|2020-04-12|
| 2| 3542|2020-05-22|
| 2| 947|2020-06-02|
| 2| 702|2020-07-06|
| 2| 1886|2020-08-12|
| 2| 3510|2020-09-16|
+---+------+----------+
结果应该是这样的(列的顺序可能不同):
+---+--------------+-------+-------+--------------+-------+-------+--------------+
| id|median_2020-q1|q2-q1_s|q2-q1_p|median_2020-q2|q3-q2_s|q3-q2_p|median_2020-q3|
+---+--------------+-------+-------+--------------+-------+-------+--------------+
| 1| 2340| 560| 23.9| 2900| 301| 10.4| 3201|
| 2| 1846| 1696| 91.9| 3542| -1656| -46.8| 1886|
+---+--------------+-------+-------+--------------+-------+-------+--------------+
第 2、5 和 8 列 = 每季度“金额”的中位数
第 3 列和第 6 列 = 两个季度的中位数之差
第 4 列和第 7 列 = 两个季度之间的百分比差异(例如 q2-q1_p = q2-q1_s/median_2020-q1 * 100)
非常感谢任何有用的建议如何做到这一点
为了计算中位数,您可以使用percentile_approx函数。 这是您解决问题的方法
from pyspark.sql.types import IntegerType
from pyspark.sql import functions as F
schema = "id INT, amount INT, timestmp STRING"
data = ((1,5000,"06.01.2020 00:39"), \
(1,2340,"26.02.2020 12:35"), \
(1,491,"01.03.2020 04:55"), \
(1,7801,"09.04.2020 14:51"), \
(1,2900,"19.05.2020 00:51"), \
(1,1200,"29.06.2020 10:01"), \
(1,890,"03.07.2020 12:31"), \
(1,3201,"09.08.2020 01:07"), \
(1,4449,"13.09.2020 17:01"), \
(2,3945,"09.01.2020 00:39"), \
(2,1846,"29.02.2020 12:35"), \
(2,387,"04.03.2020 04:55"), \
(2,6155,"12.04.2020 14:51"), \
(2,3542,"22.05.2020 00:51"), \
(2,947,"02.06.2020 10:01"), \
(2,702,"06.07.2020 12:31"), \
(2,1886,"12.08.2020 01:07"), \
(2,3510,"16.09.2020 17:01"))
dfraw = spark.createDataFrame(data, schema)
df = dfraw.withColumn("purch_date", F.to_date(F.col("timestmp"),'dd.MM.yyyy HH:mm')).drop("timestmp")
df = df.withColumn('quarters',F.quarter(df.purch_date).cast('string')).drop("purch_date")
output = df.groupBy('id').pivot("quarters").agg(
F.percentile_approx("amount", 0.5, F.lit(1000000)))
output = output.withColumn('q2-q1_s',F.col('2') - F.col('1')) \
.withColumn('q2-q1_p',F.round((F.col('2') - F.col('1'))/F.col('1')*100,1)) \
.withColumn('q3-q2_s',F.col('3') - F.col('2'))\
.withColumn('q3-q2_p',F.round((F.col('3') - F.col('2'))/F.col('2')*100,1))
output.show()
这是输出的样子:
+---+----+----+----+-------+-------+-------+-------+
| id| 1| 2| 3|q2-q1_s|q2-q1_p|q3-q2_s|q3-q2_p|
+---+----+----+----+-------+-------+-------+-------+
| 1|2340|2900|3201| 560| 23.9| 301| 10.4|
| 2|1846|3542|1886| 1696| 91.9| -1656| -46.8|
+---+----+----+----+-------+-------+-------+-------+
注意:您可能想查看有关如何动态创建多列的答案
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.