[英]How can I enumerate rows in groups with Spark/Python?
I'd like to enumerate grouped values just like with Pandas: 我想像Pandas一样枚举分组值:
Enumerate each row for each group in a DataFrame 枚举DataFrame中每个组的每一行
What is a way in Spark/Python? Spark / Python有什么方法?
With row_number
window function: 使用
row_number
窗口函数:
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))
You can achieve this on rdd level by doing: 您可以通过执行以下操作在rdd级别实现此目的:
rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()
It will result: +---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+
结果是:
+---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+
+---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+
+---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+
If you only need unique ID, not real continuous indexing, you may also use zipWithUniqueId()
which is more efficient, since done locally on each partition. +---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+
如果你只需要唯一的ID,而不是真正的连续索引,你也可以使用更高效的zipWithUniqueId()
,因为在每个分区本地完成。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.