简体   繁体   中英

Fastest And Effective Way To Iterate Large DataSet in Java Spark

I am converting spark dataset into list of hash maps by using below approach, My end goal is to build either list of json objects or list of hashmaps I am running this code on 3.2 millions of rows

List<HashMap> finalJsonMap = new ArrayList<HashMap>();
    srcData.foreachPartition(new ForeachPartitionFunction<Row>() {
        public void call(Iterator<Row> t) throws Exception {
            while (t.hasNext()){
                Row eachRow = t.next();
                HashMap rowMap = new HashMap();
                for(int j = 0; j < grpdColNames.size(); j++) {
                    rowMap.put(grpdColNames.get(j), eachRow.getString(j));  
                }
                finalJsonMap.add(rowMap);
            }
        }
    });

The iteration is working fine but I am unable to add rowMap into finalJsonMap.

What is the best approach to do this?

That's really not how Spark works.

The code which, is put in foreachPartition is executed in a different context than original

List<HashMap> finalJsonMap = new ArrayList<HashMap>();

All you can do in such setup is to modify local copy.

This has been discussed multiple times on Stack Overflow and is described in detail in the official documentation in the Understanding Closures section.

Considering the required result (ie local collection) there is really nothing else you can do than converting your code to use mapPartitions and collect . That's however hardly efficient or idiomatic in Spark.

I'd strongly recommend rethinking your current design.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM