简体   繁体   中英

what is the most efficient way in pyspark to reduce a dataframe?

I have the following dataframe with the two first row looking like:

['station_id', 'country', 'temperature', 'time']
['12', 'usa', '22', '12:04:14']

I want to display the average temperature by descending order of the first 100 stations in 'france'.

What is the best way (Most efficient) to do it in pyspark?

We translate your query to Spark SQL in the following way:

from pyspark.sql.functions import mean, desc

df.filter(df["country"] == "france") \ # only french stations
  .groupBy("station_id") \ # by station
  .agg(mean("temperature").alias("average_temp")) \ # calculate average
  .orderBy(desc("average_temp")) \ # order by average 
  .take(100) # return first 100 rows

Using the RDD API and anonymous functions:

df.rdd \
  .filter(lambda x: x[1] == "france") \ # only french stations
  .map(lambda x: (x[0], x[2])) \ # select station & temp
  .mapValues(lambda x: (x, 1)) \ # generate count
  .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1])) \ # calculate sum & count
  .mapValues(lambda x: x[0]/x[1]) \ # calculate average
  .sortBy(lambda x: x[1], ascending = False) \ # sort
  .take(100)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM