简体繁体 English

如何用Java读写堆外内存？

[英]How is off heap memory read/written in Java?

原文 2016-11-03 14:37:00 9 3 java/ scala/ memory/ apache-spark/ heap-memory

In my Spark program, I'm interested in allocating and using data that is not touched by Java's garbage collector. 在我的Spark程序中，我对分配和使用Java的垃圾收集器未触及的数据感兴趣。 Basically, I want to do the memory management of such data myself like you would do in C++. 基本上，我想像使用C ++一样自己对此类数据进行内存管理。 Is this a good case of using off heap memory? 这是使用堆外内存的好例子吗？ Secondly, how do you read and write to off heap memory in Java or Scala. 其次，如何在Java或Scala中读写堆外内存。 I tried searching for examples, but couldn't find any. 我尝试搜索示例，但找不到任何示例。

3 个解决方案

Manual memory management is a viable optimization strategy for garbage collected languages. 手动内存管理是针对垃圾收集语言的可行优化策略。 Garbage collection is a known source of overhead and algorithms can be tailored to minimize it. 垃圾收集是已知的开销来源，可以对算法进行定制以使其最小化。 For example, when picking a hash table implementation one might prefer Open Addressing because it allocates its entries manually on the main array instead of handling them to the language memory allocation and its GC. 例如，当选择一个哈希表实现时，可能更喜欢使用Open Addressing，因为它在主数组上手动分配其条目，而不是将其处理给语言内存分配及其GC。 As another example, here's a Trie searcher that packs the Trie into a single byte array in order to mimimize the GC overhead. 再举一个例子，这是一个Trie搜索器，它将Trie打包为一个字节数组，以最小化GC开销。 Similar optimization can be used for regular expressions. 类似的优化可用于正则表达式。

That kind of optimization, when the Java arrays are used as a low-level storage for the data, goes hand in hand with the Data-oriented design , where data is stored in arrays in order to achieve better cache locality. 当将Java数组用作数据的低级存储时，这种优化与面向数据的设计紧密结合，后者将数据存储在数组中，以实现更好的缓存局部性。 Data-oriented design is widely used in gamedev, where the performance matters. 面向数据的设计在性能至关重要的gamedev中被广泛使用。

In JavaScript this kind of array-backed data storage is an important part of asm.js . 在JavaScript中，这种支持数组的数据存储是asm.js的重要组成部分。

The array-backed approach is sufficiently supported by most garbage collectors used in the Java world, as they'll try to avoid moving the large arrays around. Java世界中使用的大多数垃圾收集器都充分支持了数组支持的方法，因为它们将尽量避免移动大型数组。

If you want to dig deeper, in Linux you can create a file inside the "/dev/shm" filesystem. 如果您想更深入地研究，可以在Linux中的“ / dev / shm”文件系统中创建一个文件。 This filesystem is backed by RAM and won't be flushed to disk unless your operating system is out of memory. 该文件系统由RAM支持，除非操作系统内存不足，否则不会刷新到磁盘。 Memory-mapping such files (with FileChannel.map ) is a good enough way to get the off-heap memory directly from the operating system. 内存映射此类文件（使用FileChannel.map ）是一种很好的方式，可以直接从操作系统获取堆外内存。 ( MappedByteBuffer operations are JIT-optimized to direct memory access, minus the boundary checks). （对MappedByteBuffer操作进行了JIT优化，以直接进行内存访问，减去边界检查）。

If you want to go even deeper, then you'll have to resort to JNI libraries in order to access the C-level memory allocator, malloc . 如果想更深入一点，则必须使用JNI库才能访问C级内存分配器malloc 。

If you are not able to achieve "Efficiency with Algorithms, Performance with Data Structures", and if efficiency and performance are so critical, you could consider using "sun.misc.Unsafe". 如果您无法实现“算法效率，数据结构性能”，并且效率和性能如此关键，则可以考虑使用“ sun.misc.Unsafe”。 As the name suggests it is unsafe!!! 顾名思义，这是不安全的！！！

Spark is already using it as mentioned in project-tungsten . 如project-tungsten中所述，Spark已经在使用它。

Also, you can start here , to understand it better!!! 另外，您可以从这里开始，以更好地了解它！！！

Note: Spark provides a highly concurrent for execution of application and with multiple JVMs mostlikely across multiple machines, manual memory management will be extreamly complex. 注意：Spark为应用程序的执行提供了高度并发性，并且由于很可能在多台机器上使用多个JVM，因此手动内存管理将极其复杂。 Fundamemtally spark promotes re-computation over global shared memory. 从根本上，spark促进了对全球共享内存的重新计算。 So, perhaps, you could store partially computed data/result in another store like HDFS/Kafka/cassandra!!! 因此，也许您可以将部分计算的数据/结果存储在另一个存储中，例如HDFS / Kafka / cassandra ！！！

Have a look at ByteBuffer.allocateDirect(int bytes) . 看看ByteBuffer.allocateDirect（int bytes）。 You don't need to memory map files to make use of them. 您不需要存储映射文件即可使用它们。

Off heap can be a good choice if the objects will stick there for a while (ie are reused). 如果对象会在此处停留一段时间（即被重用），则“堆外”是一个不错的选择。 If you'll be allocating/deallocating them as you go, that's going to be slower. 如果您要随手分配/取消分配它们，那将会变慢。

Unsafe is cool but it's going to be removed . 不安全是很酷的东西，但是它将被移除。 Probably in Java 9. 可能在Java 9中。