简体   繁体   English

加入后按火花对 RDD 中的值排序

[英]Order by value in spark pair RDD after join

I have 2 paired RDDs that I joined them together using the same key and I now I want to sort the result using one of the values.我有 2 个配对的 RDD,我使用相同的键将它们连接在一起,现在我想使用其中一个值对结果进行排序。 The new joined RDD type is : RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])]新加入的 RDD 类型是: RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])]

where the first section is the paired RDD key and the iterable part is the values from the two RDD I joined.其中第一部分是配对的 RDD 键,可迭代部分是我加入的两个 RDD 的值。 I want now to order them by the Time field of the second RDD.我现在想通过第二个 RDD 的 Time 字段对它们进行排序。 I tried to use sortBy function but I got errors.我尝试使用 sortBy 函数,但出现错误。

Any ideas?有任何想法吗?

Thanks谢谢

Spark pair RDDs have a mapValues method. Spark 对 RDD 有一个 mapValues 方法。 I think it will help you.我想它会帮助你。

    def mapValues[U](f: (V) ⇒ U): RDD[(K, U)]
    Pass each value in the key-value pair RDD through a map function 
without changing the keys; this also retains the original RDD's partitioning.

Spark Documentation has more details. Spark 文档有更多细节。

You're right that you can use sortBy function:你是对的,你可以使用sortBy函数:

val yourRdd: RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])] = ...(your cogroup operation here)

val result = yourRdd.sortBy({
  case ((str, i), iter) if iter.nonEmpty => iter.head._2._
  }, true)

iter.head has type of ((String, DateTime, Int,Int), (String, DateTime, String, String)) ; iter.head类型为((String, DateTime, Int,Int), (String, DateTime, String, String))

iter.head._2 has type of (String, DateTime, String, String) and iter.head._2类型为(String, DateTime, String, String)

iter.head._2._2 is indeed has type of DateTime . iter.head._2._2确实具有DateTime类型。

And maybe you should provide implicit ordering object for Datetime like this .也许您应该像这样为 Datetime 提供隐式排序对象。 By the way, may the iterator be emtpy?顺便说一下,迭代器可能是空的吗? Then you should add this case to sortBy function.然后您应该将此案例添加到sortBy函数。 And if there are many items in this iterator which one to choose for sorting?如果这个迭代器中有很多项,选择哪一项进行排序?

If the RDD's Iterable needs to be sorted:如果 RDD 的 Iterable 需要排序:

val rdd: RDD[((String, Int), 
             Iterable[((String, DateTime, Int,Int), 
                       (String, DateTime, String, String))])] = ???

val dateOrdering = new Ordering[org.joda.time.DateTime]{ 
    override def compare(a: org.joda.time.DateTime,
                         b: org.joda.time.DateTime) = 
        if (a.isBefore(b)) -1 else 1
}

rdd.mapValues(v => v.toArray
                    .sortBy(x => x._2._2)(dateOrdering))

Using python:使用蟒蛇:

sortedRDD = unsortedRDD.sortBy(lambda x:x[1][1], False)

This will sort by descending order这将按降序排序

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM