简体   繁体   English

Spark java.lang.StackOverflowError

[英]Spark java.lang.StackOverflowError

I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). 我正在使用spark来计算用户评论的pagerank,但是当我在一个大数据集上运行我的代码时,我一直得到Spark java.lang.StackOverflowError (40k条目)。 when running the code on a small number of entries it works fine though. 当在少量条目上运行代码时,它工作正常。

Entry Example : 输入示例:

product/productId: B00004CK40   review/userId: A39IIHQF18YGZA   review/profileName: C. A. M. Salas  review/helpfulness: 0/0 review/score: 4.0   review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.

The Code: 代码:

public void calculatePageRank() {
    sc.clearCallSite();
    sc.clearJobGroup();

    JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
    sc.setCheckpointDir("pagerankCheckpoint/");

    JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {

        @Override
        public String call(String arg0) throws Exception {
            String[] data = arg0.split("\t");
            String movieId = data[0].split(":")[1].trim();
            String userId = data[1].split(":")[1].trim();
            return movieId + "\t" + userId;
        }
    });

    JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {

        @Override
        public Tuple2 < String, String > call(String arg0) throws Exception {
            String[] data = arg0.split("\t");
            return new Tuple2 < String, String > (data[0], data[1]);
        }
    }).groupByKey().cache();


    JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
      List<Iterable<String>> cartUsersList = cartUsers.collect();
      JavaPairRDD<String,String> finalCartesian = null;
      int iterCounter = 0;
      for(Iterable<String> out : cartUsersList){
          JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
          if(finalCartesian==null){
              finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
          }
          else{
              finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
              if(iterCounter % 20 == 0) {
                  finalCartesian.checkpoint();
              }
          }
      }
      JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));

      finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
      JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));

      JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {

        //Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
          @Override
          public String call (Tuple2<String, String> t) throws Exception {
            return t._1 + " " + t._2;
          }
      });

    try {

//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
        JavaPageRank.calculatePageRank(userIdPairsString, 100);
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    sc.close();

}

When your for loop grows really large, Spark can no longer keep track of the lineage. 当你的for循环变得非常大时,Spark无法再跟踪血统。 Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. 在for循环中启用检查点,以便每10次迭代检查一次rdd。 Checkpointing will fix the problem. 检查点将解决问题。 Don't forget to clean up the checkpoint directory after. 不要忘记清理检查点目录。

http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

I have multiple suggestions which will help you to greatly improve the performance of the code in your question. 我有多个建议可以帮助您大大提高问题代码的性能。

  1. Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms. 缓存:缓存应该用于那些您需要一次又一次地引用相同/不同操作的数据集(迭代算法)。

An example is RDD. 一个例子是RDD。 count — to tell you the number of lines in the file, the file needs to be read. count - 告诉你文件中的行数,需要读取文件。 So if you write RDD. 所以,如果你写RDD。 count , at this point the file will be read, the lines will be counted, and the count will be returned. count ,此时将读取文件,计算行数,并返回计数。

What if you call RDD. 如果你打电话给RDD怎么办? count again? count一次? The same thing: the file will be read and counted again. 同样的事情:文件将被读取并再次计数。 So what does RDD. 那么RDD是什么? cache do? cache呢? Now, if you run RDD. 现在,如果你运行RDD。 count the first time, the file will be loaded, cached, and counted. 第一次count ,文件将被加载,缓存和计数。 If you call RDD. 如果你打电话给RDD。 count a second time, the operation will use the cache. count第二次,操作将使用缓存。 It will just take the data from the cache and count the lines, no recomputing. 它只会从缓存中获取数据并计算行数,而不是重新计算。

Read more about caching here . 阅读更多有关缓存的信息

In your code sample you are not reusing anything that you've cached. 在您的代码示例中,您不会重复使用已缓存的任何内容。 So you may remove the .cache from there. 所以你可以从那里删除.cache

  1. Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. 并行化:在代码示例中,您已经并行化了RDD中已经是分布式集合的每个元素。 I suggest you to merge the rddFileData , rddMovieData and rddPairReviewData steps so that it happens in one go. 我建议你合并rddFileDatarddMovieDatarddPairReviewData步骤,以便一次性发生。

Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error. 摆脱.collect因为这会将结果带回驱动程序,也许是错误的实际原因。

This problem will occur when your DAG grows big and too many level of transformations happening in your code. 当您的DAG变大并且代码中发生过多级别的转换时,就会出现此问题。 The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end. 在最终执行操作时,JVM将无法保持执行延迟执行的操作。

Checkpointing is one option. 检查点是一种选择。 I would suggest to implement spark-sql for this kind of aggregations. 我建议为这种聚合实现spark-sql。 If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this. 如果您的数据是结构化的,请尝试将其加载到数据框中并执行分组和其他mysql函数来实现此目的。

Unfortunately, the solution that worked easily for me was to call .collect() after every few iterations . 不幸的是, 对我来说很容易的解决方案是在每几次迭代后调用.collect() Well, things work at least, as a quick fix. 嗯,事情至少起作用,作为一个快速解决方案。

In a hurry, I couldn't make the suggest solution to use checkpoint to work (and maybe it wouldn't have worked anyway?) 匆忙,我无法建议使用检查点工作的建议解决方案(也许它无论如何都不会工作?)


Note: it also seems that setting spark option might do the trick... but I don't have the time right now so I didn't check how to set spark's java options from pyspark, if that's possible. 注意:似乎设置spark选项可能会起作用......但我现在没时间,所以我没有检查如何从pyspark设置spark的java选项,如果可能的话。 Related pages for changing config: 更改配置的相关页面:

If someone gets that to work by changing the max recursion limit, a comment here would be nice for others. 如果有人通过更改最大递归限制来实现这一点,那么这里的评论对其他人来说会很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM