简体   繁体   中英

Sort by key (Month) using RDDs in Pyspark

I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark? Note: Don't want to use spark.sql or Dataframe.

+-----+-----+
|Month|count|
+-----+-----+
|  Oct| 1176|
|  Sep| 1167|
|  Dec| 2084|
|  Aug| 1126|
|  May| 1176|
|  Jun| 1424|
|  Feb| 1286|
|  Nov| 1078|
|  Mar| 1740|
|  Jan| 1544|
|  Apr| 1080|
|  Jul| 1237|
+-----+-----+

You can use rdd.sortBy with a helper dictionary available in python's calendar module or create your own month dictionary:

import calendar
d = {i:e for e,i in enumerate(calendar.month_abbr[1:],1)}

#{'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 
#'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}

myrdd.sortBy(keyfunc=lambda x: d.get(x[0])).collect()

[('Jan', 1544),
 ('Feb', 1286),
 ('Mar', 1740),
 ('Apr', 1080),
 ('May', 1176),
 ('Jun', 1424),
 ('Jul', 1237),
 ('Aug', 1126),
 ('Sep', 1167),
 ('Oct', 1176),
 ('Nov', 1078),
 ('Dec', 2084)]
myList = myrdd.collect()
my_list_dict = dict(myList)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
newList = []
for m in months:
  newList.append((m, my_list_dict[m]))
print(newList)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM