简体   繁体   中英

Spark generate a rank on specific field of RDD

I have already got a rdd as caculation result let's say it's as following format:

(uid, factor, name, avatar, gender, otherFactor1, otherFactor2)

And now I want the RDD to be sorted by factor and make a field like rank which indicates the rank of the record and later on use a foreach to write every record to the DB

I know I might do this by:

rdd.sortBy{
   case (uid, factor, name, avatar, gender, otherFactor1, otherFactor2) => {
       factor
   }
}.foreach{
   //how could I insert a rank field by the index of the loop?
}

And here I am stuck by how to add the rank field by the foreach loop's index

any idea?

As mentioned in the comments, you can use

rdd.sortBy(_._2).zipWithIndex

You can flatten it to a more decent structure using :

rdd.sortBy(_._2).zipWithIndex.map { 
    case ((uid, factor, name, avatar, gender, otherFactor1, otherFactor2), rank) =>
    (uid, factor, name, avatar, gender, otherFactor1, otherFactor2, rank)
}

One thing you might want to note about zipWithIndex , from the source code for RDD.scala

This method needs to trigger a spark job when this RDD contains more than one partitions.

If you want to avoid that, you can use zipWithUniqueId but I don't think it gives contiguous indices for each element.

See if below can be helpful.

case class ItemInfo(item:String, quantity:Int)
val data = sc.parallelize(List(("a",10),("b",20),("c",30)))
val ItemDF = data.map(x=> ItemInfo(x._1,x._2)).toDF()
ItemDF.registerTempTable("Item_tbl")
val rankedItems = sqlContext.sql("select item, quantity, rank() over(order by quantity desc) as rank from Item_tbl")
rankedItems.collect().foreach(println)

This example ranks the item based on quantity.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM