简体   繁体   English

Pyspark:如何添加带有行号的列?

[英]Pyspark: how to add a column with the row number?

I have a pyspark dataframe.我有一个pyspark dataframe。 I would like to add a column with that contains the row number.我想添加一个包含行号的列。

This is what I am doing这就是我正在做的

stop_df = stop_df.withColumn("stop_id", monotonically_increasing_id())

If I check the maximum value of stop_id , I get如果我检查stop_id的最大值,我会得到

stop_df.agg(max("stop_id")).show()
+--------------+
|  max(stop_id)|
+--------------+
|32478542692458|
+--------------+

while the number of rows is而行数是

stop_df.count()
Out[4]: 8134605

From spark monotonically_increasing_id docs:从 spark monotonically_increasing_id文档:

A column that generates monotonically increasing 64-bit intege rs .生成单调递增的 64 位整数的列。

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.生成的ID保证单调递增且唯一,但不连续。 The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits.当前实现将分区 ID 放在高 31 位中,将每个分区内的记录号放在低 33 位中。 The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.假设数据帧的分区少于10亿,每个分区的记录少于80亿。

Use window row_number function to get the row number.使用window row_number function 获取行号。

df=spark.createDataFrame([("a",),("b",)],["id"])
from pyspark.sql.window import Window
from pyspark.sql.functions import *
#add partition by and order by clause if ordering required with in window.
w=Window.orderBy(lit(1))

df.withColumn("stop_id",row_number().over(w)).show()
#+---+-------+
#| id|stop_id|
#+---+-------+
#|  a|      1|
#|  b|      2|
#+---+-------+

df.withColumn("stop_id",row_number().over(w)).agg(max("stop_id")).show()
#+------------+
#|max(stop_id)|
#+------------+
#|           2|
#+------------+

df.count()
#2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM