Spark：打破分区迭代器以获得更好的内存管理？

Question

I'm trying to develop a heavy mathematical calculations in Spark, both in term of time and memory (up to O(n^2) for both of them). 我试图在Spark中开发一个繁重的数学计算，包括时间和内存（两者都达到O(n^2) ）。 I've found that the partition holding an Iterator is not really adequate for big calculus since it forces to instantiate (though lazily since it's an Iterator ) one object per line. 我发现持有Iterator的分区对于大型微积分来说并不是真的足够，因为它强制实例化（虽然它是一个Iterator因为它是一个Iterator ）每行一个对象。 Indeed in a most simple scenario, one would hold a vector per line for instance. 实际上，在最简单的情况下，例如，每行可以保持一个向量。 But it's both harmful for memory, as we know the JVM overhead for objects and all the pressure that is put on the GC, and for speed, as I may really improve performances improving my linear algebra operations up to BLAS level-3 (matrix by matrix instead of matrix by vector which I'm stuck with in this paradigm). 但它对内存都是有害的，因为我们知道对象的JVM开销和GC上的所有压力，以及速度，因为我可能真正提高性能，改善我的线性代数操作，直到BLAS level-3（matrix by矩阵而不是矢量矩阵，我在这个范例中坚持使用）。 In a very schematic here's what I want to achieve: 在一个非常示意图中，这是我想要实现的目标：

while (???) { // loop over some condition, doesn't really matter what
    val matrix = ??? // an instance of a matrix
    val broadMatrix = sparkContext.broadcast(matrix)
    // rdd is an instance of RDD[Vector] that is already cached
    rdd.mapPartition {
        iter =>
            val matrixValue = broadMatrix.value()
            iter.map (vector => matrixValue * vec)
    }
    // a bunch of other things relying on that result
}

Here are my thoughts: 这是我的想法：

as my rdd in the code above is cached, then having an Iterator is useless, isn't it? 因为我在上面的代码中的rdd是缓存的，那么拥有一个Iterator是没用的，不是吗？ Since the only advantage of it is not to hold in memory all the lines at the same time: but here it's been computed and cached so all the lines are held in memory... Yes of course one could argue that Spark's might have an intelligent cache that serializes and compress data (which I doubt when the storage level is MEMORY_ONLY though...). 因为它的唯一优点是不会同时在内存中保存所有行：但是在这里它被计算和缓存所以所有行都保存在内存中...是的当然可以说Spark可能有智能用于序列化和压缩数据的缓存（我怀疑存储级别是MEMORY_ONLY虽然......）。
if 1. is true, then the only thing it produces is a huge memory overhead, as I have as many JVM objects as there are rows in my rdd but I could lower it down to a single JVM object per partition. 如果1.是真的，那么它产生的唯一东西就是巨大的内存开销，因为我有很多JVM对象，因为我的rdd有行，但是我可以将它降低到每个分区的一个JVM对象。 I could even lower it down to a single object per Executor having a scala object that would act as shared memory for all the partitions living on the same executor (this I fear might be hard to handle though as I want to keep Spark's resilience, hence if a partition should be remove for any reason and re-appear on another executor I don't want to handle it by myself but let Spark move all the related objects by itself...). 我甚至可以将它降低到每个Executor的单个对象，它有一个scala object ，可以作为生成在同一个执行器上的所有分区的共享内存（我担心这可能很难处理，因为我想保持Spark的弹性，因此如果一个分区因任何原因被删除并重新出现在另一个执行器上，我不想自己处理它，但让Spark自己移动所有相关的对象......）。

My idea hence would be to transform this rdd of vector into one containing matrices, something like: 因此，我的想法是将这个vector rdd转换为包含矩阵的一个，例如：

while (???) { // loop over some condition, doesn't really matter what
    val matrix = ??? // an instance of a matrix
    val broadMatrix = sparkContext.broadcast(matrix)
    // rdd is an instance of RDD[Vector] that is already cached
    rdd.mapPartition {
        iter =>
            val matrixValue = broadMatrix.value()
            // iter actually contains one single element which is the matrix containing all vectors stacked
            // here we have a BLAS-3 operation
            iter.map (matrix => matrixValue * matrix)
    }
    // a bunch of other things relying on that result
}

Anyone already faced this dilemna? 任何人都已经面对这个困境？ Have you experienced advance usage of memory management as this one? 您是否经历过内存管理的高级使用？

Answer 1

as I may really improve performances improving my linear algebra operations up to BLAS level-3 (matrix by matrix instead of matrix by vector which I'm stuck with in this paradigm). 因为我可以真正提高性能，改善我的线性代数操作，直到BLAS level-3（矩阵矩阵而不是矩阵的矢量，我在这个范例中坚持使用）。

Using Iterators doesn't force you in any way to use Vectors , or even more than one element for each partition. 使用Iterators器不会强迫您以任何方式使用Vectors ，甚至不会强制使用每个分区的多个元素。 You can easily create a single Matrix object for each split if you want. 如果需要，您可以轻松地为每个拆分创建一个Matrix对象。

both harmful for memory, as we know the JVM overhead for objects and all the pressure that is put on the GC 这两者都对内存有害，因为我们知道对象的JVM开销以及放在GC上的所有压力

I'd argue that it is more complicated than this. 我认为它比这更复杂。 The reason for using Iterators is to be able to handle partitions which are larger than memory. 使用Iterators的原因是能够处理大于内存的分区。 With lazy Iterators and small objects Spark can spill partial results to disk and make them accessible for garbage collecting. 使用惰性Iterators和小对象，Spark可以将部分结果溢出到磁盘，并使它们可以进行垃圾收集。 This cannot happen when you use a single large object. 使用单个大对象时不会发生这种情况。 From my experience Spark is much more susceptible to GC problems with large objects. 根据我的经验，Spark更容易受到大型对象的GC问题的影响。

Based on the description I suspect it would make sense to avoid storing data explicitly and instead initializing objects explicitly using off heap memory. 根据描述，我怀疑避免显式存储数据，而是使用off heap memory显式初始化对象是有意义的。 This should keep GC at bay and allow you to handle large objects. 这应该使GC处于停留状态，并允许您处理大型对象。 But it is way above may pay grade. 但它是上面可能支付等级的方式。

Spark：打破分区迭代器以获得更好的内存管理？

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-10 09:45:38

Spark：打破分区迭代器以获得更好的内存管理？

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-10 09:45:38

解决方案1
1 已采纳 2017-08-10 09:45:38