简体   繁体   中英

How to have Nested Map RDD's in Spark

I have a text file like:-

ID,Hour,Ratio
100775,0.0,1.0
100775,1.0,1.0560344797302321
100775,2.0,1.1333317975785973
100775,3.0,1.1886133302168074
100776,4.0,1.2824427440125867

I want a structure like MAP{Hour,MAP{ID,Ratio}} , to be stored as a RDD. The closest structure I could find was JavaPairRDD. I tried implementing a structure like JavaPairRDD{Hour,MAP{ID,Ratio}} , however, this structure offers lookup() functionality, which returns LIST{MAP{ID,RATIO}} , which does not solve my use-case, as I essentially want to do

ratio = MAP.get(Hour).get(ID)

Any pointers on how to best get this done.

UPDATE :-

After Ramesh's answer, I tried the following:-

JavaRDD<Map<String,Map<String,String>>> mapRDD =  data.map(line -> line.split(",")).map(array-> Collections
              .singletonMap(array[0],
                Collections
                .singletonMap
                (array[1],array[2])));

However, there is no lookup() like functionality available here, correct?

Here's what you can do

scala> val rdd = sc.textFile("path to the csv file")
rdd: org.apache.spark.rdd.RDD[String] = path to csv file MapPartitionsRDD[7] at textFile at <console>:24

scala> val maps = rdd.map(line => line.split(",")).map(array => (array(1), Map(array(0) -> array(2)))).collectAsMap()
maps: scala.collection.Map[String,scala.collection.immutable.Map[String,String]] = Map(1.0 -> Map(100775 -> 1.0560344797302321), 4.0 -> Map(100776 -> 1.2824427440125867), 0.0 -> Map(100775 -> 1.0), 3.0 -> Map(100775 -> 1.1886133302168074), 2.0 -> Map(100775 -> 1.1333317975785973))

If you require RDD[Map[String, Map[String, String]]] then you can do the following.

scala> val rddMaps = rdd.map(line => line.split(",")).map(array => Map(array(1) -> Map(array(0) -> array(2)))).collect
rddMaps: Array[scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,String]]] = Array(Map(0.0 -> Map(100775 -> 1.0)), Map(1.0 -> Map(100775 -> 1.0560344797302321)), Map(2.0 -> Map(100775 -> 1.1333317975785973)), Map(3.0 -> Map(100775 -> 1.1886133302168074)), Map(4.0 -> Map(100776 -> 1.2824427440125867)))

I hope the answer is helpful

For my use-case, I have decided to go with the following:-

I created a JavaPairRDD{Hour,MAP{ID,Ratio}} . At any time the task would be running, I would require the map corresponding to that hour only. So I did the following:-

Map<String, Double> result = new HashMap<>();
 javaRDDPair.lookup(HOUR).stream().forEach(map ->{
            result.putAll(map.entrySet().stream().collect(Collectors.toMap(entry-> entry.getKey(), entry-> entry.getValue())));
        });

This could now be further used as a broadcast variable.

It's a common problem to work with a dataset in spark. Usually there is a dataset which contains some samples as each row in it and each column represents a feature for each sample. But a common solution for the common problem is defining an Entity to support each column as its properties and each sample would be a RDD object. To access each of these objects in the rdd can be done by using javapairrdd and set eg in this example HOUR as its key, the result would be something like:

   Javapairrdd<INTEGER,Entity>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM