[英]Pyspark: how to add a column with the row number?
I have a pyspark
dataframe.我有一个
pyspark
dataframe。 I would like to add a column with that contains the row number.我想添加一个包含行号的列。
This is what I am doing这就是我正在做的
stop_df = stop_df.withColumn("stop_id", monotonically_increasing_id())
If I check the maximum value of stop_id
, I get如果我检查
stop_id
的最大值,我会得到
stop_df.agg(max("stop_id")).show()
+--------------+
| max(stop_id)|
+--------------+
|32478542692458|
+--------------+
while the number of rows is而行数是
stop_df.count()
Out[4]: 8134605
From spark monotonically_increasing_id docs:从 spark monotonically_increasing_id文档:
A column that generates monotonically increasing 64-bit intege rs .生成单调递增的 64 位整数的列。
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
生成的ID保证单调递增且唯一,但不连续。 The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits.
当前实现将分区 ID 放在高 31 位中,将每个分区内的记录号放在低 33 位中。 The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
假设数据帧的分区少于10亿,每个分区的记录少于80亿。
Use window row_number
function to get the row number.使用window
row_number
function 获取行号。
df=spark.createDataFrame([("a",),("b",)],["id"])
from pyspark.sql.window import Window
from pyspark.sql.functions import *
#add partition by and order by clause if ordering required with in window.
w=Window.orderBy(lit(1))
df.withColumn("stop_id",row_number().over(w)).show()
#+---+-------+
#| id|stop_id|
#+---+-------+
#| a| 1|
#| b| 2|
#+---+-------+
df.withColumn("stop_id",row_number().over(w)).agg(max("stop_id")).show()
#+------------+
#|max(stop_id)|
#+------------+
#| 2|
#+------------+
df.count()
#2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.