简体   繁体   English

如何从pyspark的数据框列中获取第一个值和最后一个值?

[英]how to get first value and last value from dataframe column in pyspark?

I Have Dataframe,I want get first value and last value from DataFrame column.我有数据帧,我想从数据帧列中获取第一个值和最后一个值。

+----+-----+--------------------+
|test|count|             support|
+----+-----+--------------------+
|   A|    5| 0.23809523809523808|
|   B|    5| 0.23809523809523808|
|   C|    4| 0.19047619047619047|
|   G|    2| 0.09523809523809523|
|   K|    2| 0.09523809523809523|
|   D|    1|0.047619047619047616|
+----+-----+--------------------+

expecting output is from support column first,last value ie x=[0.23809523809523808,0.047619047619047616.]期望输出首先来自支持列,最后一个值即x=[0.23809523809523808,0.047619047619047616.]

You may use collect but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items.您可以使用collect但性能会很糟糕,因为驱动程序将收集所有数据,只是为了保留第一个和最后一个项目。 Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.更糟糕的是,如果你有一个大数据帧,它很可能会导致 OOM 错误,因此根本不起作用。

Another idea would be to use agg with the first and last aggregation function.另一个想法是将agg与第firstlast聚合函数一起使用。 This does not work!这不起作用! (because the reducers do not necessarily get the records in the order of the dataframe) (因为reducers不一定按照dataframe的顺序获取记录)

Spark offers a head function, which makes getting the first element very easy. Spark 提供了一个head函数,这使得获取第一个元素非常容易。 However, spark does not offer any last function.但是,spark 不提供任何last功能。 A straightforward approach would be to sort the dataframe backward and use the head function again.一种直接的方法是将数据帧向后排序并再次使用head函数。

first=df.head().support
import pyspark.sql.functions as F
last=df.orderBy(F.monotonically_increasing_id().desc()).head().support

Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex to index the dataframe and only keep the first and the last elements.最后,由于仅仅为了获取第一个和最后一个元素而对数据帧进行排序是一种耻辱,我们可以使用 RDD API 和zipWithIndex来索引数据帧,并且只保留第一个和最后一个元素。

size = df.count()
df.rdd.zipWithIndex()\
  .filter(lambda x : x[1] == 0 or x[1] == size-1)\
  .map(lambda x : x[0].support)\
  .collect()

You can try indexing the data frame see below example:您可以尝试索引数据框,请参见以下示例:

df = <your dataframe>
first_record = df.collect()[0]
last_record = df.collect()[-1]

EDIT: You have to pass the column name as well.编辑:您还必须传递列名。

df = <your dataframe>
first_record = df.collect()[0]['column_name']
last_record = df.collect()[-1]['column_name']

Since version 3.0.0, spark also have DataFrame function called .tail() to get the last value.从 3.0.0 版本开始,spark 也有名为.tail() 的DataFrame 函数来获取最后一个值。

This will return List of Row objects:这将返回Row对象列表:

last=df.tail(1)[0].support

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM