使用Clojure / Flambo在Spark中进行Secondar排序

Question

I have a scala program in which I have implemented a secondary sort which works perfectly. 我有一个scala程序，其中实现了一个完美运行的辅助排序。 The way I have written that program is: 我编写该程序的方式是：

object rfmc {
  // Custom Key and partitioner

  case class RFMCKey(cId: String, R: Double, F: Double, M: Double, C: Double)
  class RFMCPartitioner(partitions: Int) extends Partitioner {
    require(partitions >= 0, "Number of partitions ($partitions) cannot be negative.")
    override def numPartitions: Int = partitions
    override def getPartition(key: Any): Int = {
      val k = key.asInstanceOf[RFMCKey]
      k.cId.hashCode() % numPartitions
    }
  }
  object RFMCKey {
    implicit def orderingBycId[A <: RFMCKey] : Ordering[A] = {
      Ordering.by(k => (k.R, k.F * -1, k.M * -1, k.C * -1))
    }
  }
  // The body of the code
  //
  //
  val x = rdd.map(RFMCKey(cust,r,f,m,c), r+","+f+","+m+","+c)
  val y = x.repartitionAndSortWithinPartitions(new RFMCPartitioner(1))
}

I wanted to implement the same thing using clojure's DSL for spark called flambo. 我想使用clojure的DSL来实现称为flambo的火花来实现相同的功能。 Since I can't write partitioner using clojure, I re-used the code defind above, compiled it and used it as a dependency in my Clojure code. 由于我无法使用clojure编写分区程序，因此我重复使用了上面定义的代码，对其进行了编译并将其用作Clojure代码中的依赖项。

Now I am importing the partitioner and the key in my clojure code the following way: 现在，我通过以下方式在我的clojure代码中导入分区程序和密钥：

(ns xyz
  (:import
    [package RFMCPartitioner]
    [package RFMCKey]
    )
  )

But when I try to create RFMCKey by doing (RFMCKey. cust_id rfmc) , it throws the following error: 但是，当我尝试通过执行(RFMCKey. cust_id rfmc)创建RFMCKey ，它将引发以下错误：

java.lang.ClassCastException: org.formcept.wisdom.RFMCKey cannot be cast to java.lang.Comparable
    at org.spark-project.guava.collect.NaturalOrdering.compare(NaturalOrdering.java:28)
    at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
    at org.apache.spark.util.collection.ExternalSorter$$anon$8.compare(ExternalSorter.scala:170)
    at org.apache.spark.util.collection.ExternalSorter$$anon$8.compare(ExternalSorter.scala:164)
    at org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:252)
    at org.apache.spark.util.collection.TimSort.sort(TimSort.java:110)
    at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
    at org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
    at org.apache.spark.util.collection.ExternalSorter.partitionedIterator(ExternalSorter.scala:687)
    at org.apache.spark.util.collection.ExternalSorter.iterator(ExternalSorter.scala:705)
    at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:64)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

My guess is that its not able to find the ordering that I have defined after the partitioner. 我的猜测是，它无法找到我在分区程序之后定义的顺序。 But if it works in Scala, why doesn't it work in Clojure? 但是，如果它在Scala中可用，为什么在Clojure中不起作用？

Answer 1

So I finally figured it out on my own. 所以我终于自己解决了。 I had to basically write my custom ordering function as a separate scala project and then call that in clojure. 我基本上必须将我的自定义排序功能编写为一个单独的scala项目，然后在clojure中调用它。

I had my scala file written in this manner: 我以这种方式编写了scala文件：

import org.apache.spark.Partitioner
import org.apache.spark.rdd.RDD

case class RFMCKey(cId: String, R: Double, F: Long, M: Double, C: Double)
class RFMCPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, "Number of partitions ($partitions) cannot be negative.")
  override def numPartitions: Int = partitions
  override def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[RFMCKey]
    k.cId.hashCode() % numPartitions
  }
}
object RFMCKey {
  implicit def orderingBycId[A <: RFMCKey] : Ordering[A] = {
    Ordering.by(k => (k.R, k.F * -1, k.M * -1, k.C * -1))
  }
}

class rfmcSort {
  def sortWithRFMC(a: RDD[(String, (((Double, Long), Double), Double))], parts: Int): RDD[(RFMCKey, String)] = {
    val x = a.map(v => v match {
                case (custId, (((rVal, fVal), mVal),cVal)) => (RFMCKey(custId, rVal, fVal, mVal, cVal), rVal+","+fVal+","+mVal+","+cVal)
            }).repartitionAndSortWithinPartitions(new RFMCPartitioner(parts))
    x
  }
}

I compiled it as ascala project and used it in my clojure code this way: 我将其编译为ascala项目，并以这种方式在我的clojure代码中使用了它：

(:import [org.formcept.wisdom rfmcSort]
         [org.apache.spark.rdd.RDD])

sorted-rfmc-records (.toJavaRDD (.sortWithRFMC (rfmcSort.) (.rdd rfmc-records) num_partitions))

Please notice the way I am calling the sortWithRFMC function from the rfmcSort object that I created. 请注意我从创建的rfmcSort对象调用sortWithRFMC函数的方式。 Also one very important thing to note here is when you pass your JavaPairRDD to your scala function, you have to convert it into a normal spark RDD first by calling the .rdd method on it. 同样要注意的一件事是，当您将JavaPairRDD传递给scala函数时，必须首先通过对其调用.rdd方法将其转换为普通的.rdd spark RDD 。 And then you have to convert the spark RDD back to JavaPairRDD to work with it in clojure. 然后，您必须将spark RDD转换回JavaPairRDD以便在clojure中使用它。

使用Clojure / Flambo在Spark中进行Secondar排序

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-07-11 18:43:40

使用Clojure / Flambo在Spark中进行Secondar排序

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-07-11 18:43:40

解决方案1
0 已采纳 2016-07-11 18:43:40