简体   繁体   中英

Hashmap Over Large dataset giving OutOfMemory in spark

I have requirement of updating hashmap. In Spark job I have JavaPairRDD and in this wrapper is having 9 different hashmap. Each hashmap is having key near about 40-50 cr keys. While merging two maps (ReduceByKey in spark) I am getting Java heap memory OutOfMemory exception. Below is the code snippet.

 private HashMap<String, Long> getMergedMapNew(HashMap<String, Long> oldMap, 
    HashMap<String, Long> newMap)  {
    for (Entry<String, Long> entry : newMap.entrySet()) {
        try {
            String imei = entry.getKey();
            Long oldTimeStamp = oldMap.get(imei);
            Long newTimeStamp = entry.getValue();

            if (oldTimeStamp != null && newTimeStamp != null) {
                if (oldTimeStamp < newTimeStamp) {
                    oldMap.put(imei, newTimeStamp);
                } else {
                    oldMap.put(imei, oldTimeStamp);
                }

            } else if (oldTimeStamp == null) {
                oldMap.put(imei, newTimeStamp);
            } else if (newTimeStamp == null) {
                oldMap.put(imei, oldTimeStamp);
            }
        } catch (Exception e) {
            logger.error("{}", Utils.getStackTrace(e));
        }
    }
    return oldMap;
}  

This method works on small dataset but failed with large dataset. Same method is being used for all 9 different hashmap. I searched for increasing heap memory but no idea how to increase this in spark as it works on cluster. My cluster size is also large (nr. 300 nodes). Please help me to find out some solutions.

Thanks.

Firstly I'd focus on 3 parameters: spark.driver.memory=45g spark.executor.memory=6g spark.dirver.maxResultSize=8g Don't take the config for granted, this is something that works on my set up without OOM errors. Check how much available memory you have in UI. You want to give executors as much memory as you can. btw. spark.driver.memory enables more heap space.

As far as i can see, this code is executed on the spark driver. I would recommend to convert those two Hashmaps to DataFrames with 2 columns imei and timestamp . Then join both using an outer join on imei and select the appropriate timestamps using when . This code will be executed on the workers, be parallized and consequentially you wont run into the memory problems. If you plan on really doing this on the driver then follow the instructions given by Jarek and increase spark.driver.memory .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM