I have already got a rdd
as caculation result let's say it's as following format:
(uid, factor, name, avatar, gender, otherFactor1, otherFactor2)
And now I want the RDD to be sorted by factor
and make a field like rank
which indicates the rank of the record and later on use a foreach to write every record to the DB
I know I might do this by:
rdd.sortBy{
case (uid, factor, name, avatar, gender, otherFactor1, otherFactor2) => {
factor
}
}.foreach{
//how could I insert a rank field by the index of the loop?
}
And here I am stuck by how to add the rank
field by the foreach loop's index
any idea?
As mentioned in the comments, you can use
rdd.sortBy(_._2).zipWithIndex
You can flatten it to a more decent structure using :
rdd.sortBy(_._2).zipWithIndex.map {
case ((uid, factor, name, avatar, gender, otherFactor1, otherFactor2), rank) =>
(uid, factor, name, avatar, gender, otherFactor1, otherFactor2, rank)
}
One thing you might want to note about zipWithIndex
, from the source code for RDD.scala
This method needs to trigger a spark job when this RDD contains more than one partitions.
If you want to avoid that, you can use zipWithUniqueId
but I don't think it gives contiguous indices for each element.
See if below can be helpful.
case class ItemInfo(item:String, quantity:Int)
val data = sc.parallelize(List(("a",10),("b",20),("c",30)))
val ItemDF = data.map(x=> ItemInfo(x._1,x._2)).toDF()
ItemDF.registerTempTable("Item_tbl")
val rankedItems = sqlContext.sql("select item, quantity, rank() over(order by quantity desc) as rank from Item_tbl")
rankedItems.collect().foreach(println)
This example ranks the item based on quantity.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.