简体   繁体   English

Clojure:Spark Graphx的Scala / Java互操作问题

[英]Clojure: Scala/Java interop issues for Spark Graphx

I am trying to use Spark/GraphX using Clojure & Flambo . 我正在尝试使用Clojure和Flambo使用Spark / GraphX

Here is the code I ended up with: 这是我最终得到的代码:

In the project.clj file: project.clj文件中:

(defproject spark-tests "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [yieldbot/flambo "0.5.0"]]
  :main ^:skip-aot spark-tests.core
  :target-path "target/%s"
  :checksum :warn
  :profiles {:dev {:aot [flambo.function]}
             :uberjar {:aot :all}
             :provided {:dependencies
                        [[org.apache.spark/spark-core_2.10 "1.3.0"]
                         [org.apache.spark/spark-core_2.10 "1.2.0"]
                         [org.apache.spark/spark-graphx_2.10 "1.2.0"]]}})

And then my Clojure core.clj file: 然后我的Clojure core.clj文件:

(ns spark-tests.core  
  (:require [flambo.conf :as conf]
            [flambo.api :as f]
            [flambo.tuple :as ft])
  (:import (org.apache.spark.graphx Edge)
           (org.apache.spark.graphx.impl GraphImpl)))

(defonce c (-> (conf/spark-conf)
               (conf/master "local")
               (conf/app-name "flame_princess")))

(defonce sc (f/spark-context c))

(def users (f/parallelize sc [(ft/tuple 3 ["rxin" "student"])
                              (ft/tuple 7 ["jgonzal" "postdoc"])
                              (ft/tuple 5 ["franklin" "prof"])]))

(defn edge
  [source dest attr]
  (new Edge (long source) (long dest) attr))

(def relationships (f/parallelize sc [(edge 3 7 "collab")
                                      (edge 5 3 "advisor")]))

(def g (new GraphImpl users relationships))

When I run that code, I am getting the following error: 运行该代码时,出现以下错误:

1. Caused by java.lang.ClassCastException
   Cannot cast org.apache.spark.api.java.JavaRDD to
   scala.reflect.ClassTag

  Class.java: 3258  java.lang.Class/cast
  Reflector.java:  427  clojure.lang.Reflector/boxArg
  Reflector.java:  460  clojure.lang.Reflector/boxArgs

Disclaimer: I have no Scala knowledge. 免责声明:我没有Scala知识。

Then I thought that it may be because Flambo returns a JavaRDD when we use f/parallelize . 然后我想可能是因为当我们使用f/parallelize时, Flambo返回了JavaRDD。 Then I tried to convert the JavaRDD into a simple RDD as used in the GraphX example: 然后,我尝试将JavaRDD转换为GraphX示例中使用的简单RDD:

(def g (new GraphImpl (.rdd users) (.rdd relationships)))

But the I am getting the same error but for the ParallelCollectionRDD class... 但是我正在为ParallelCollectionRDD类得到相同的错误...

From there, I am have idea of what may be causing this. 从那里,我对造成这种情况的原因有所了解。 The Java API for the Graph class is here , the Scala API for the same class is here . Graph类Java API在这里同一类的Scala API在这里

What I am not clear about is how to effectively use that class signature in Clojure: 我不清楚的是如何在Clojure中有效使用该类签名:

org.apache.spark.graphx.Graph<VD,ED>

(Graph is an abstract class, but I tried using GraphImpl in this example) (Graph是一个抽象类,但在此示例中我尝试使用GraphImpl)

What I am trying to do is to re-create that Scala example using Clojure. 我正在尝试使用Clojure 重新创建该Scala示例

Any hints would be highly appreciated! 任何提示将不胜感激!

Finally got it right (I think). 终于弄对了(我认为)。 Here is the code that appears to be working: 这是似乎起作用的代码:

(ns spark-tests.core
  (:require [flambo.conf :as conf]
            [flambo.api :as f]
            [flambo.tuple :as ft])
  (:import (org.apache.spark.graphx Edge
                                    Graph)
           (org.apache.spark.api.java JavaRDD
                                      StorageLevels)
           (scala.reflect ClassTag$)))

(defonce c (-> (conf/spark-conf)
               (conf/master "local")
               (conf/app-name "flame_princess")))

(defonce sc (f/spark-context c))

(def users (f/parallelize sc [(ft/tuple 3 ["rxin" "student"])
                              (ft/tuple 7 ["jgonzal" "postdoc"])
                              (ft/tuple 5 ["franklin" "prof"])]))

(defn edge
  [source dest attr]
  (new Edge (long source) (long dest) attr))

(def relationships (f/parallelize sc [(edge 3 7 "collab")
                                      (edge 5 3 "advisor")
                                      (edge 7 3 "advisor")]))


(def g (Graph/apply (.rdd users)
                    (.rdd relationships)
                    "collab"
                    (StorageLevels/MEMORY_ONLY)
                    (StorageLevels/MEMORY_ONLY)
                    (.apply ClassTag$/MODULE$ clojure.lang.PersistentVector)
                    (.apply ClassTag$/MODULE$ java.lang.String)))

(println (.count (.edges g)))

What this code returns is 3 which seems to be exact. 该代码返回的是3 ,这似乎是正确的。 The main issue was that I was not creating the class using Graph/Apply . 主要问题是我没有使用Graph/Apply创建类。 In fact, it appears that this is the way to create all the objects (looks to be the constructor...). 实际上,这似乎是创建所有对象的方式(看起来是构造函数...)。 I have no idea why this is what way, but this is probably due to my lack of Scala knowledge. 我不知道为什么会这样,但这可能是由于我缺乏Scala知识。 If anybody knows, just tell me why :) 如果有人知道,请告诉我为什么:)

After that I only had to fill-in the gaps for the signature of the apply function. 在那之后,我只需要填补apply函数签名的空白即可。

One thing to note are the last two parameters: 需要注意的一件事是最后两个参数:

  • scala.reflect.ClassTag<VD> evidence$17
  • scala.reflect.ClassTag<ED> evidence$18

This is used to instruct Scala of the vertex attribute type ( VD ) and the edge attribute type ( ED ). 这用于指示Scala vertex attribute typeVD )和edge attribute typeED )。 The type of ED is the type of the object I used as the third parameter of the Edge class. ED的类型是我用作Edge类的第三个参数的对象的类型。 Then the type of VD is the type of the second parameter of the tuple function. 那么, VD的类型就是tuple函数的第二个参数的类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM