[英]how to get first value and last value from dataframe column in pyspark?
I Have Dataframe,I want get first value and last value from DataFrame column.我有数据帧,我想从数据帧列中获取第一个值和最后一个值。
+----+-----+--------------------+
|test|count| support|
+----+-----+--------------------+
| A| 5| 0.23809523809523808|
| B| 5| 0.23809523809523808|
| C| 4| 0.19047619047619047|
| G| 2| 0.09523809523809523|
| K| 2| 0.09523809523809523|
| D| 1|0.047619047619047616|
+----+-----+--------------------+
expecting output is from support column first,last value ie x=[0.23809523809523808,0.047619047619047616.]
期望输出首先来自支持列,最后一个值即
x=[0.23809523809523808,0.047619047619047616.]
You may use collect
but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items.您可以使用
collect
但性能会很糟糕,因为驱动程序将收集所有数据,只是为了保留第一个和最后一个项目。 Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.更糟糕的是,如果你有一个大数据帧,它很可能会导致 OOM 错误,因此根本不起作用。
Another idea would be to use agg
with the first
and last
aggregation function.另一个想法是将
agg
与第first
和last
聚合函数一起使用。 This does not work!这不起作用! (because the reducers do not necessarily get the records in the order of the dataframe)
(因为reducers不一定按照dataframe的顺序获取记录)
Spark offers a head
function, which makes getting the first element very easy. Spark 提供了一个
head
函数,这使得获取第一个元素非常容易。 However, spark does not offer any last
function.但是,spark 不提供任何
last
功能。 A straightforward approach would be to sort the dataframe backward and use the head
function again.一种直接的方法是将数据帧向后排序并再次使用
head
函数。
first=df.head().support
import pyspark.sql.functions as F
last=df.orderBy(F.monotonically_increasing_id().desc()).head().support
Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex
to index the dataframe and only keep the first and the last elements.最后,由于仅仅为了获取第一个和最后一个元素而对数据帧进行排序是一种耻辱,我们可以使用 RDD API 和
zipWithIndex
来索引数据帧,并且只保留第一个和最后一个元素。
size = df.count()
df.rdd.zipWithIndex()\
.filter(lambda x : x[1] == 0 or x[1] == size-1)\
.map(lambda x : x[0].support)\
.collect()
You can try indexing the data frame see below example:您可以尝试索引数据框,请参见以下示例:
df = <your dataframe>
first_record = df.collect()[0]
last_record = df.collect()[-1]
EDIT: You have to pass the column name as well.编辑:您还必须传递列名。
df = <your dataframe>
first_record = df.collect()[0]['column_name']
last_record = df.collect()[-1]['column_name']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.