简体   繁体   中英

Apache Spark GraphX java.lang.ArrayIndexOutOfBoundsException

I am trying to understand how to work with Spark-GraphX but always have some problems, so maybe somebody could advise me what to read etc. I tried to read Spark documentation and Learning Spark - O'Reilly Media book, but could not find any explanation how much memory we need for dealing with differently sized networks etc.

For my tests I use several example datasets. I run them on 1 master node (~16Gb RAM) from Spark shell:

./bin/spark-shell --master spark://192.168.0.12:7077 --executor-memory 2900m --driver-memory 10g

And 3-5 workers (1 worker per 1 separate machine, which has 4Gb RAM):

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://192.168.0.12:7077

Then from Spark Shell I run my scala scripts (not compiled):

:load /home/ubuntu/spark-1.2.1/bin/script.scala

I do not use HDFS yet, just copied the dataset files to each machine (with the same path names of course). On small networks like zachary club or even bigger ~256 Mb networks (after increasing driver-memory parameter) I am able to count triangles, wedges, etc.

Now try to deal with 750+ Mb networks and have errors. For example, I have Wikipedia links dataset in format of 2 columns (link_from link_to), 750Mb. Try to load it:

val graph = GraphLoader.edgeListFile(sc, "graphx/data/dbpidia")

And get an error:

[Stage 0:==============================================>     (22 + 1) / 23]
15/04/30 22:52:46 WARN TaskSetManager: Lost task 22.0 in stage 0.0 (TID 22, host-192-168-0-18.openstacklocal): java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:83)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:76)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:76)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:74)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:631)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:631)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/04/30 22:52:47 WARN TaskSetManager: Lost task 22.2 in stage 0.0 (TID 24, host-192-168-0-18.openstacklocal): java.lang.ArrayIndexOutOfBoundsException

Actually I need to work with datasets of size >>1Tb, but get errors even on smaller one. What am I doing wrong? What are the memory limits? What strategy could you propose for >>1Tb files, how to store them better? Thanks.

It might be a bug of GraphX

https://issues.apache.org/jira/browse/SPARK-5480

I have the same issue as yours. It works fine on a small data set. When the data size gets bigger, spark throws ArrayIndexOutOfBoundsException error

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM