简体   繁体   English

Apache Spark GraphX java.lang.ArrayIndexOutOfBoundsException

[英]Apache Spark GraphX java.lang.ArrayIndexOutOfBoundsException

I am trying to understand how to work with Spark-GraphX but always have some problems, so maybe somebody could advise me what to read etc. I tried to read Spark documentation and Learning Spark - O'Reilly Media book, but could not find any explanation how much memory we need for dealing with differently sized networks etc. 我试图了解如何使用Spark-GraphX,但始终会遇到一些问题,所以也许有人可以建议我阅读什么,等等。我试图阅读Spark文档和学习Spark-O'Reilly Media图书,但找不到任何内容解释我们处理不同大小的网络等需要多少内存。

For my tests I use several example datasets. 对于我的测试,我使用了几个示例数据集。 I run them on 1 master node (~16Gb RAM) from Spark shell: 我从Spark shell在1个主节点(〜16Gb RAM)上运行它们:

./bin/spark-shell --master spark://192.168.0.12:7077 --executor-memory 2900m --driver-memory 10g

And 3-5 workers (1 worker per 1 separate machine, which has 4Gb RAM): 和3-5个工作人员(每1台单独的计算机有1个工作人员,它具有4Gb RAM):

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://192.168.0.12:7077

Then from Spark Shell I run my scala scripts (not compiled): 然后从Spark Shell运行我的Scala脚本(未编译):

:load /home/ubuntu/spark-1.2.1/bin/script.scala

I do not use HDFS yet, just copied the dataset files to each machine (with the same path names of course). 我还没有使用HDFS,只是将数据集文件复制到每台机器上(当然具有相同的路径名)。 On small networks like zachary club or even bigger ~256 Mb networks (after increasing driver-memory parameter) I am able to count triangles, wedges, etc. 在诸如zachary club或什至更大的〜256 Mb网络之类的小型网络上(增加驱动程序内存参数后),我能够计算三角形,楔形等。

Now try to deal with 750+ Mb networks and have errors. 现在尝试处理750+ Mb网络并出现错误。 For example, I have Wikipedia links dataset in format of 2 columns (link_from link_to), 750Mb. 例如,我有Wikipedia链接数据集,格式为2列(link_from link_to),750Mb。 Try to load it: 尝试加载它:

val graph = GraphLoader.edgeListFile(sc, "graphx/data/dbpidia")

And get an error: 并得到一个错误:

[Stage 0:==============================================>     (22 + 1) / 23]
15/04/30 22:52:46 WARN TaskSetManager: Lost task 22.0 in stage 0.0 (TID 22, host-192-168-0-18.openstacklocal): java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:83)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:76)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:76)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:74)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:631)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:631)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/04/30 22:52:47 WARN TaskSetManager: Lost task 22.2 in stage 0.0 (TID 24, host-192-168-0-18.openstacklocal): java.lang.ArrayIndexOutOfBoundsException

Actually I need to work with datasets of size >>1Tb, but get errors even on smaller one. 实际上,我需要处理大小>> 1Tb的数据集,但是即使是较小的数据集也会出现错误。 What am I doing wrong? 我究竟做错了什么? What are the memory limits? 内存限制是多少? What strategy could you propose for >>1Tb files, how to store them better? 您可以针对>> 1Tb文件提出什么策略,如何更好地存储它们? Thanks. 谢谢。

It might be a bug of GraphX 可能是GraphX的错误

https://issues.apache.org/jira/browse/SPARK-5480 https://issues.apache.org/jira/browse/SPARK-5480

I have the same issue as yours. 我和你有同样的问题。 It works fine on a small data set. 它适用于小型数据集。 When the data size gets bigger, spark throws ArrayIndexOutOfBoundsException error 当数据大小变大时,spark引发ArrayIndexOutOfBoundsException错误

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM