避免在大型数据集上使用收集

Question

We're using Apache Spark for processing.我们使用 Apache Spark 进行处理。 We have several steps where it is necessary to use collect() to to a JavaRDD to a list, but we are wanting to avoid doing this in order to operate on a list.我们有几个步骤需要使用 collect() 将 JavaRDD 转换为列表，但为了对列表进行操作，我们希望避免这样做。 We know we want to avoid this because it brings everything back to the driver.我们知道我们想要避免这种情况，因为它会将所有内容都带回给驱动程序。 It ends up and we run out of memory because we are processing anywhere from 5million - 200 million records.它最终会耗尽内存，因为我们正在处理 500 万到 2 亿条记录。 Here's an example of what we have so far.这是迄今为止我们所拥有的一个例子。

private InputStream createCSVObject(JavaRDD<Object[]> args) {
        System.out.println("inside createCSVObject");
        try {
            StringBuilder value = new StringBuilder(CHUNK_SIZE);

            args.collect().forEach(i -> {
                value.append(i[0].toString());
                for (int j = 1; j < i.length; ++j) {
                    value.append("," + i[j]);
                }
                value.append("\n");
            });
            System.out.println("Out of createCSVObject for loops");
            byte[] strBytes = value.toString().getBytes();

            InputStream myInputStream = new ByteArrayInputStream(strBytes);
            return (myInputStream);
        } catch (Exception e) {
            System.err.println(String.format("ERROR: FileWriterService - writeFile: %s", e.getMessage()));
            return null;
        }
    }

I've searched for this over and over across SO and google, and can't come up with anything definitive.我已经在 SO 和 google 上一遍又一遍地搜索了这个，但找不到任何确定的东西。 Does anyone have any ideas???有没有人有任何想法？？？

Note: the COLLECT at args.collect()注意：args.collect() 中的 COLLECT

EDIT:编辑：

After looking into the proposed answer below we devised a simple proof of concept for it, and what we came up with does one iteration through every 40s.在查看下面建议的答案后，我们为其设计了一个简单的概念证明，我们提出的方法每 40 秒迭代一次。 The logic is not complex, why is it so slow?逻辑不复杂，为什么这么慢？

        System.out.println("inside createCSVObject");
        try {
            StringBuilder value = new StringBuilder();
            System.out.println("args length " + args.toLocalIterator().next().length);

             while (args.toLocalIterator().hasNext()) {
                 Object[] objects = args.toLocalIterator().next();
                 System.out.println("Inside iterator");
                 value.append(objects[0].toString());
                 for (int j = 1; j < objects.length; ++j) {
                     value.append("," + objects[j]);
                 }
                 value.append("\n");
             }

            System.out.println("Out of createCSVObject for loops");
            byte[] strBytes = value.toString().getBytes();

            InputStream myInputStream = new ByteArrayInputStream(strBytes);
            return (myInputStream);
        } catch (Exception e) {
            System.err.println(String.format("ERROR: FileWriterService - writeFile: %s", e.getMessage()));
            e.printStackTrace();
            return null;
        }

Answer 1

You can use JavaRDD.toLocalIterator() to iterate through the entire RDD on the driver without collecting it all into a list.您可以使用JavaRDD.toLocalIterator()遍历驱动程序上的整个 RDD，而无需将其全部收集到列表中。 Instead, it brings each partition over to the driver one at a time, so doesn't use more memory than the size of the largest partition ( documentation ).相反，它将每个分区一次一个地交给驱动程序，因此不会使用比最大分区大小更多的内存（文档）。

Obviously, in the exmple you've given, you still have the problem that you're collecting everything into a massive byte array, which will still use quite a lot of memory.显然，在您给出的示例中，您仍然存在将所有内容收集到一个庞大的字节数组中的问题，该数组仍将使用相当多的内存。 Instead, you could write a custom InputStream class that wraps an Iterator (as returned by toLocalIterator ), and only buffers one element at a time, calling next() on the iterator only when InputStream.read() demands more data.相反，您可以编写一个自定义InputStream类来包装Iterator （由toLocalIterator返回），并且一次仅缓冲一个元素，仅当InputStream.read()需要更多数据时才在迭代器上调用next() 。

避免在大型数据集上使用收集

问题描述

1 个解决方案

解决方案1
0 2019-11-25 17:26:14

避免在大型数据集上使用收集

问题描述

1 个解决方案

解决方案1 0 2019-11-25 17:26:14

解决方案1
0 2019-11-25 17:26:14