简体   繁体   中英

Apache Spark: User Memory vs Spark Memory

I'm building a Spark application where I have to cache about 15 GB of CSV files. I read about the new UnifiedMemoryManager introduced in Spark 1.6 here:

https://0x0fff.com/spark-memory-management/

It shows also this picture: 在此输入图像描述

The author differs between User Memory and Spark Memory (which is again splitted into Storage and Execution Memory ). As I understud, the Spark Memory is flexible for execution (shuffle, sort etc) and storing (caching) stuff - If one needs more memory it can use it from the other part (if not already completly used). Is this assumption correct?

The User Memory is described like this:

User Memory. This is the memory pool that remains after the allocation of Spark Memory, and it is completely up to you to use it in a way you like. You can store your own data structures there that would be used in RDD transformations. For example, you can rewrite Spark aggregation by using mapPartitions transformation maintaining hash table for this aggregation to run, which would consume so called User Memory. [...] And again, this is the User Memory and its completely up to you what would be stored in this RAM and how, Spark makes completely no accounting on what you do there and whether you respect this boundary or not. Not respecting this boundary in your code might cause OOM error.

How can I access this part of the memory or how is this managed by Spark?

And for my purpose I just have to have enough Storage memory (as I don't do things like shuffle, join etc.)? So, can I set the spark.memory.storageFraction property to 1.0?

The most important question to me is, what about the User Memory? Wherefore is it, especially for my purpose that I described above?

Is there a difference in using the Memory when I change the program to use some own classes eg RDD<MyOwnRepresentationClass> instead of RDD<String> ?

Here is my code snippet (calling it many times from Livy Client in a benchmark application. I'm using Spark 1.6.2 with Kryo serialization.

JavaRDD<String> inputRDD = sc.textFile(inputFile);

// Filter out invalid values
JavaRDD<String> cachedRDD = inputRDD.filter(new Function<String, Boolean>() {
    @Override
    public Boolean call(String row) throws Exception {
        String[] parts = row.split(";");

        // Some filtering stuff

        return hasFailure;
    }
}).persist(StorageLevel.MEMORY_ONLY_SER());

unified memory manager

1) on HEAP: Objects are allocated on the JVM heap and bound by GC.

2) OFF HEAP: Objects are allocated in memory outside the JVM by serialization, managed by the application, and are not bound by GC. This memory management method can avoid frequent GC, but the disadvantage is that you have to write the logic of memory allocation and memory release.

ON HEAP :

Storage Memory: It's mainly used to store Spark cache data, such as RDD cache, Broadcast variable, Unroll data, and so on.

Execution Memory/shuffle memory: It's mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc.

User Memory: It's mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency.

Reserved Memory: The memory is reserved for system and is used to store Spark's internal objects.

OFF HEAP MEMORY : - 1) Storage Memory ( shuffle memory) 2) Execution Memory

Here is the official documentation, I am not sure if the statements in blog is 100% accurate. https://spark.apache.org/docs/latest/tuning.html#memory-management-overview

The "User Memory" is actually called "execution memory". As the name suggests it is used for - computation in shuffles, joins, sorts and aggregations etc. As your code is executed it uses this memory and releases it when it is done. Just imagine JVM's heap space is used for running a Java program. We use this memory implicitly as our program runs. Example - when a file is read into a Dataset, it uses this memory.

The "storage memory" is used when we explicitly cache a dataset using dataset.cache or dataset.persist calls. This memory is release when we un-persist the cache explicitly in the code.

It is not advisable to set spark.memory.storageFraction to 1. Leave it as default 0.5. It is important for the application not to crash due to lack of execution memory. If you don't cache objects then at best the application will be slower but won't crash. If you need more memory then assign more memory to executor.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM