
[英]Increment values of a column based on the current row of another column and the previous row of the same column
[英]Pyspark: Generate a column based on another column that has repeatedly appended values upto the current row [closed]
使用F.collect_list
和explode
函数。
窗口上的F.collect_list
将累积附加A
列中的列表。
explode
和collect_list
可以列出清单合并成一个。
你的 df:
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *
schema = StructType([StructField("product", StringType()), StructField("year", IntegerType()), \
StructField("A", ArrayType(IntegerType()))])
data = [['A', 2010, [1,2,3]], ['A', 2011, [4,5,6]], ['A', 2012, [7,8,]], \
['B', 2009, [10,11,12]], ['B', 2010, [16,17]], ['B', 2011, [20,21,22,23]], ['B', 2012, [24]]]
df = spark.createDataFrame(data,schema=schema)
w = Window.partitionBy("product").orderBy("year")
df.withColumn("first_list", F.collect_list("A").over(w))\
.withColumn("first_explode", F.explode((F.col("first_list"))))\
.withColumn("second_explode", F.explode(F.col("first_explode")))\
.withColumn("cum_list", F.collect_list("second_explode").over(Window.partitionBy("product", "year")))\
.drop("first_list", "first_explode", "second_explode").distinct()\
.orderBy("product", "year").show(truncate=False)
(使用F.collect_set
上second_explode
如果你不想重复)
输出:
+-------+----+----------------+----------------------------------------+
|product|year|A |cum_list |
+-------+----+----------------+----------------------------------------+
|A |2010|[1, 2, 3] |[1, 2, 3] |
|A |2011|[4, 5, 6] |[1, 2, 3, 4, 5, 6] |
|A |2012|[7, 8] |[1, 2, 3, 4, 5, 6, 7, 8] |
|B |2009|[10, 11, 12] |[10, 11, 12] |
|B |2010|[16, 17] |[10, 11, 12, 16, 17] |
|B |2011|[20, 21, 22, 23]|[10, 11, 12, 16, 17, 20, 21, 22, 23] |
|B |2012|[24] |[10, 11, 12, 16, 17, 20, 21, 22, 23, 24]|
+-------+----+----------------+----------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.