在Scala的不可变集中添加元素时内存不足

Question

I'm getting out of memory in a loop when I'm adding elements in an immutable set. 在不可变集中添加元素时，我陷入了循环不足的状态。 There are a lot of objects in the set already and I guess it's consuming a lot of memory. 集合中已经有很多对象，我想它正在消耗大量内存。 I know that while adding elements in immutable collections Scala will first copy the existing collection in a new set, add the element in the new set and will return this new set. 我知道，在不可变集合中添加元素时，Scala会首先将现有集合复制到一个新集合中，然后在新集合中添加该元素并将返回此新集合。

So suppose if my JVM memory is 500mb and the set is consuming 400mb. 因此，假设我的JVM内存为500mb，而该集合消耗了400mb。 Now for before adding new element Scala tries to copy old set into a new set (which I think would again consume 400mb again) now at this step, it's already exceeded the JVM memory (total consumed memory 800) and hence it throws out of memory error. 现在，在添加新元素之前，Scala尝试将旧集复制到新集中（我认为这将再次消耗400mb），现在它已经超出了JVM内存（总消耗内存800），因此它抛出了内存不足错误。 The code looks little bit like below 代码如下所示

private def getNewCollection(myMuttableSet:Set[MyType]): Set[MyType] = {
myMuttableSet.flatMap(c => {
      val returnedSet = doSomeCalculationsAndreturnASet // this method returns a large collection so duing the loop the collection grows exponentially 
      if (returnedSet.isEmpty) Set.empty[MyType]
      else doSomeCalculationsAndreturnASet + MyType(constArg1,constArg2)  (I have case class of MyType)     
    })
}

Kindly advise if my understanding is correct. 请告知我的理解是否正确。

Answer 1

It is not quite as simple as that because it depends on the size of elements in the Set . 它不是那么简单，因为它取决于Set中元素的大小。

Creating a new Set is a shallow operation and does not copy the elements in the set, it just creates a new wrapper (typically a hash table of some sort) pointing to the same objects. 创建一个新的Set是一项浅层的操作，它不会复制该Set中的元素，它只是创建一个指向相同对象的新包装器（通常是某种形式的哈希表）。

If you have a small set of large objects then duplicating that set might not take much storage because the objects will be shared between the two sets. 如果您有一小组大型对象，那么复制该组对象可能不会占用太多存储空间，因为对象将在这两组对象之间共享。 Most of the memory is used by the objects in the set and these do not need to be copied to create a new set. 集合中的对象使用了大部分内存，不需要复制这些对象即可创建新的集合。 So your 400Mb might become 450Mb and fit within the memory limit. 因此，您的400Mb可能会变为450Mb，并符合内存限制。

If you have a large set of small objects then duplicating that set may double the storage. 如果您有大量的小对象，则复制该对象可能会使存储量增加一倍。 Most of the memory is used in the Set itself and can't be shared between the original set and the copy. 大部分内存都用在Set本身中，不能在原始Set和副本之间共享。 In this case your 400Mb could easily become close to 800Mb. 在这种情况下，您的400Mb可能很容易接近800Mb。

Since you are running out of memory and you say there are a lot of objects, then it sounds like this is the problem, but we would need to see the code to tell for sure. 由于您的内存不足，并且您说有很多对象，所以听起来像是问题所在，但是我们需要查看代码来确定。

Answer 2

Now for before adding new element Scala tries to copy old set into a new set (which I think would again consume 400mb again) now at this step, 现在，在添加新元素之前，Scala尝试在此步骤中将旧集复制到新集中（我认为它将再次消耗400mb），

This is not correct. 这是不正确的。

Immutable collections in scala (including Sets ) are implemented as persistent data structures , which usually have a property called "structural sharing". scala中的不可变集合（包括Sets ）被实现为持久数据结构，该数据结构通常具有称为“结构共享”的属性。 That means, when the structure is updated, it's not fully copied, but instead most of it is reused, with only relatively small part being actually re-created from scratch. 这意味着，在更新结构时，不会完全复制该结构，而是将其大部分重用，只有相对较小的一部分实际上是从头开始重新创建的。

The easiest example to illustrate that is List , which is implemented as a single-linked list, with root pointing to the head. 最简单的示例是List ，它实现为单链接列表，其根指向头部。

For example, you have the following code: 例如，您具有以下代码：

val a = List(3,2,1)
val b = 4 :: a
val c = 5 :: b

Although the three lists combined have 3 + 4 + 5 = 12 elements in total, they physically share the nodes, and there are only 5 List nodes. 尽管这三个列表的总和为3 + 4 + 5 = 12个元素，但它们在物理上共享节点，并且只有5个List节点。

5 →  4  →  3 →  2  → 1
↑    ↑     ↑
c    b     a

Similar principle applies to Set . 类似的原理也适用于Set 。 Set in scala is implemented as a HashTrie . 在scala中Set为HashTrie 。 I won't go into the details about specifics of a Trie , just think about it as a tree with a high branching factor. 我不会详细介绍Trie的细节，而只是将其视为具有高分支因子的树。 Now when that tree is updated, it's not copied completely. 现在，当该树被更新时，它不会被完全复制。 Only the nodes that are in the path from the tree root to the new/updated node are copied. 仅复制从树根到新节点/更新节点的路径中的节点。

For the HashTrie the depth of the tree can not be more than 7 levels. 对于HashTrie ，树的深度不能超过7级。 So, when updating Set in scala you're looking at the memory allocation proportional to O(7 * 32) (7 levels max, each node roughly speaking is an array of 32) in the worst case, regardless of the Set size. 因此，在scala中更新Set时，无论设置的大小如何，在最坏的情况下，您都将查看与O(7 * 32)成比例的内存分配（最大为7个级别，每个节点大约是32个数组）。

Looking at you code, you have following things in memory: 查看您的代码，内存中有以下内容：

myMuttableSet is present until getNewCollection returns myMuttableSet存在直到getNewCollection返回
myMuttableSet.flatMap creates mutable buffer underneath. myMuttableSet.flatMap在下面创建可变缓冲区。 Also, after flatMap is done, buffer.result will copy the content of the mutable buffer over to immutable set. 同样，在flatMap完成之后， buffer.result会将可变缓冲区的内容复制到不可变的set中。 So there is actually a brief moment when two sets exist. 因此，实际上只有一小段时间，即存在两套。
on every step of flatMap , returnedSet also retains the memory. 上的每一个步骤flatMap ， returnedSet还保留了存储器。

Side note: why are you calling doSomeCalculationsAndreturnASet again if you already have it's result cached in the returnedSet ? 旁注：你为什么打电话doSomeCalculationsAndreturnASet如果再你已经拥有它的结果缓存returnedSet ？ Could it be the root of the problem? 这可能是问题的根源吗？

So, at any given point of time you have in memory (whichever is larger): 因此，在任何给定的时间点内存中（以较大者为准）：

myMuttableSet + mutable result set buffer + returnedSet + (another?) result doSomeCalculationsAndreturnASet myMuttableSet + mutable result set buffer + returnedSet + (another?) result doSomeCalculationsAndreturnASet
myMuttableSet + mutable result set buffer + immutable result set myMuttableSet + mutable result set buffer + immutable result set

To conclude, whatever your problems with memory are, adding the element to the Set most probably is not the culprit. 总而言之，不管您的内存问题是什么，将元素添加到Set中都很可能不是罪魁祸首。 My suggestion would be to suspend you program in debugger and use any profiler (such as VisualVM) to make heap dumps at different stages. 我的建议是在调试器中暂停程序并使用任何探查器（例如VisualVM）在不同阶段进行堆转储。

在Scala的不可变集中添加元素时内存不足

问题描述

2 个解决方案

解决方案1
0 2018-11-19 19:17:34

解决方案2
0 2018-11-20 08:03:06

在Scala的不可变集中添加元素时内存不足

问题描述

2 个解决方案

解决方案1 0 2018-11-19 19:17:34

解决方案2 0 2018-11-20 08:03:06

解决方案1
0 2018-11-19 19:17:34

解决方案2
0 2018-11-20 08:03:06