简体   繁体   中英

Spark : How to merge the transformations

I have 1000 json files, i need to do some transformations on each of the file, and then create a merged output file, which can have overlapping operations on values, (for example, say, it should not have repeated values)

So, if i read the files as wholeTextFiles , as a title,content pair, and then in the map function, i parse the content as json tree and perform the transformation, where and how do i merge the output?

Do i need to have another transform on the resultant RDD's to merge the values, and how would this work? Can i have a shared object(a List or a Map or RDD(?)) amongst all map blocks, which will be updated as part of the transformation, so that i can check for repeated values there?

PS: Even if the output creates part files, i would still like to have no repititions.

Code:

//read the files as JavaPairRDD , which gives <filename, content> pairs
String filename = "/sample_jsons";
JavaPairRDD<String,String> distFile = sc.wholeTextFiles(filename);

//then create a JavaRDD from the content.
JavaRDD<String> jsonContent = distFile.map(x -> x._2);

//apply transformations, the map function will return an ArrayList which would
//have property names.

JavaRDD<ArrayList<String>> apm = jsonContent.map(
                new Function< String, ArrayList<String> >() {
                            @Override
                            public ArrayList<String> call(String arg0) throws Exception {

                                JsonNode rootNode = mapper.readTree(arg0);
                                return parseJsonAndFindKey(rootNode, "type", "rootParent");
                            }
                });

So, this way i am able to get all first level properties in an ArrayList , from each json file.

Now i need a final ArrayList , as a union of all these arraylists, removing duplicates. How can i achieve that ?

Why do you need 1000 RDDs for 1000 json files?

Do you see any issue with merging the 1000 json files in the input stage into one RDD?

If you'll be using one RDD from the input stage, it shouldn't be hard to perform all the needed actions on this RDD.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM