I have the following dataframe with the two first row looking like:
['station_id', 'country', 'temperature', 'time']
['12', 'usa', '22', '12:04:14']
I want to display the average temperature by descending order of the first 100 stations in 'france'.
What is the best way (Most efficient) to do it in pyspark?
We translate your query to Spark SQL
in the following way:
from pyspark.sql.functions import mean, desc
df.filter(df["country"] == "france") \ # only french stations
.groupBy("station_id") \ # by station
.agg(mean("temperature").alias("average_temp")) \ # calculate average
.orderBy(desc("average_temp")) \ # order by average
.take(100) # return first 100 rows
Using the RDD
API and anonymous functions:
df.rdd \
.filter(lambda x: x[1] == "france") \ # only french stations
.map(lambda x: (x[0], x[2])) \ # select station & temp
.mapValues(lambda x: (x, 1)) \ # generate count
.reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1])) \ # calculate sum & count
.mapValues(lambda x: x[0]/x[1]) \ # calculate average
.sortBy(lambda x: x[1], ascending = False) \ # sort
.take(100)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.