简体   繁体   English

Spark RDD,如何生成长度为N的JavaRDD?

[英]Spark RDD, how to generate JavaRDD of length N?

(part of problem is docs that say "undocumented" on parallelize leave me reading books for examples that don't always pertain ) (问题的一部分是在并行化上说“ undocumented”的文档让我读了一些并不总是相关的示例)

I am trying to create an RDD length N = 10^6 by executing N operations of a Java class we have, I can have that class implement Serializable or any Function if necessary. 我试图通过执行我们拥有的Java类的N次操作来创建RDD长度N = 10 ^ 6,我可以让该类实现Serializable或任何必要的Function。 I don't have a fixed length dataset up front, I am trying to create one. 我之前没有固定长度的数据集,我正在尝试创建一个。 Trying to figure out whether to create a dummy array of length N to parallelize, or pass it a function that runs N times. 试图弄清楚是创建长度为N的虚拟数组以进行并行化,还是将其传递给运行N次的函数。

Not sure which approach is valid/better, I see in Spark if I am starting out with a well defined data set like words in a doc, the length/count of those words is already defined and I just parallelize some map or filter to do some operation on that data. 不确定哪种方法有效/更好,我在Spark中看到是否从一个定义良好的数据集开始,例如文档中的单词,这些单词的长度/数量已经定义,并且我只是并行化了一些映射或过滤器来做对数据进行一些操作。

In my case I think it's different, trying to parallelize the creation an RDD that will contain 10^6 elements... 就我而言,我认为这是不同的,尝试并行创建包含10 ^ 6个元素的RDD ...

DESCRIPTION: 描述:

In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes a PipeLinkageData and returns a DropResult. 在使用Spark 1.5.1的Java 8中,我们有一个Java方法doDrop(),它使用PipeLinkageData并返回DropResult。

I am thinking I could use map() or flatMap() to call a one to many function, I was trying to do something like this in another question that never quite worked : 我想我可以使用map()或flatMap()来调用一对多函数,我试图在另一个从未成功的问题中做这样的事情:

JavaRDD<DropResult> simCountRDD = spark.parallelize(makeRange(1,getSimCount())).map(new Function<Integer, DropResult>()
    {
      public DropResult call(Integer i) { 
         return pld.doDrop(); 
      }
    });

Thinking something like this is more the correct approach? 认为这样的方法更正确吗?

    // pld is of type PipeLinkageData, it's already initialized

    // parallelize wants a collection passed into first param
    List<PipeLinkageData> pldListofOne = new ArrayList();

    // make an ArrayList of one
    pldListofOne.add(pld);

    int howMany = 1000000;

    JavaRDD<DropResult> nSizedRDD = spark.parallelize(pldListofOne).flatMap(new FlatMapFunction<PipeLinkageData, DropResult>() 
        { 
            public Iterable<DropResult> call(PipeLinkageData pld) {

                List<DropResult> returnList = new ArrayList();

                // is Spark good at spreading a for loop like this?
                for ( int i = 0; i < howMany ; i++ ){
                    returnList.add(pld.doDrop());  
                }

                // EDIT changed from returnRDD to returnList
                return returnList;
            }

            }); 

One other concern: A JavaRDD is corrrect here? 另一个问题:JavaRDD在这里更正了吗? I can see needing to call FlatMapFunction but I don't need a FlatMappedRDD? 我可以看到需要调用FlatMapFunction,但是不需要FlatMappedRDD吗? And since I am never trying to flatten a group of arrays or lists to a single array or list, do I really ever need to flatten anything? 而且由于我从不尝试将一组数组或列表展平为单个数组或列表,所以我真的需要展平吗?

  1. The first approach should work as long as DropResult and can be serialized PipeLinkageData and there are no issues with its internal logic (like depending on a shared state). 第一种方法只要可以运行DropResult并可以序列化PipeLinkageData并且其内部逻辑没有问题(例如,取决于共享状态)。

  2. The second approach in a current form doesn't make sense. 当前形式的第二种方法没有意义。 A single record will be processed on a single partition. 单个记录将在单个分区上处理。 It means a whole process will be completely sequential and can crash if data doesn't fit in a single worker memory. 这意味着整个过程将是完全顺序的,如果数据不能容纳在单个工作程序内存中,则可能会崩溃。 Increasing number of elements should solve the problem but it doesn't improve on the first approach 增加元素数量应该可以解决问题,但是在第一种方法上并不能改善

  3. Finally you can initialize an empty RDD and then use mapPartititions replacing FlatMapFunction with almost identical MapPartitionsFunction and generate required number of objects per partition. 最后,您可以初始化一个空的RDD,然后使用mapPartititions用几乎相同的MapPartitionsFunction替换FlatMapFunction ,并为每个分区生成所需数量的对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM