简体   繁体   English

如何在Spark中嵌套地图RDD

[英]How to have Nested Map RDD's in Spark

I have a text file like:- 我有一个文本文件,例如:

ID,Hour,Ratio
100775,0.0,1.0
100775,1.0,1.0560344797302321
100775,2.0,1.1333317975785973
100775,3.0,1.1886133302168074
100776,4.0,1.2824427440125867

I want a structure like MAP{Hour,MAP{ID,Ratio}} , to be stored as a RDD. 我希望将MAP{Hour,MAP{ID,Ratio}}类的结构存储为RDD。 The closest structure I could find was JavaPairRDD. 我能找到的最接近的结构是JavaPairRDD。 I tried implementing a structure like JavaPairRDD{Hour,MAP{ID,Ratio}} , however, this structure offers lookup() functionality, which returns LIST{MAP{ID,RATIO}} , which does not solve my use-case, as I essentially want to do 我尝试实现类似JavaPairRDD{Hour,MAP{ID,Ratio}}结构,但是,该结构提供了lookup()功能,该功能返回LIST{MAP{ID,RATIO}} ,但不能解决我的用例,因为我本质上想做

ratio = MAP.get(Hour).get(ID)

Any pointers on how to best get this done. 关于如何最好地做到这一点的任何指示。

UPDATE :- 更新 :-

After Ramesh's answer, I tried the following:- 在Ramesh回答之后,我尝试了以下操作:

JavaRDD<Map<String,Map<String,String>>> mapRDD =  data.map(line -> line.split(",")).map(array-> Collections
              .singletonMap(array[0],
                Collections
                .singletonMap
                (array[1],array[2])));

However, there is no lookup() like functionality available here, correct? 但是,这里没有类似lookup()的功能可用,对吗?

Here's what you can do 这是你可以做的

scala> val rdd = sc.textFile("path to the csv file")
rdd: org.apache.spark.rdd.RDD[String] = path to csv file MapPartitionsRDD[7] at textFile at <console>:24

scala> val maps = rdd.map(line => line.split(",")).map(array => (array(1), Map(array(0) -> array(2)))).collectAsMap()
maps: scala.collection.Map[String,scala.collection.immutable.Map[String,String]] = Map(1.0 -> Map(100775 -> 1.0560344797302321), 4.0 -> Map(100776 -> 1.2824427440125867), 0.0 -> Map(100775 -> 1.0), 3.0 -> Map(100775 -> 1.1886133302168074), 2.0 -> Map(100775 -> 1.1333317975785973))

If you require RDD[Map[String, Map[String, String]]] then you can do the following. 如果您需要RDD[Map[String, Map[String, String]]]则可以执行以下操作。

scala> val rddMaps = rdd.map(line => line.split(",")).map(array => Map(array(1) -> Map(array(0) -> array(2)))).collect
rddMaps: Array[scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,String]]] = Array(Map(0.0 -> Map(100775 -> 1.0)), Map(1.0 -> Map(100775 -> 1.0560344797302321)), Map(2.0 -> Map(100775 -> 1.1333317975785973)), Map(3.0 -> Map(100775 -> 1.1886133302168074)), Map(4.0 -> Map(100776 -> 1.2824427440125867)))

I hope the answer is helpful 我希望答案是有帮助的

For my use-case, I have decided to go with the following:- 对于我的用例,我决定采用以下方法:

I created a JavaPairRDD{Hour,MAP{ID,Ratio}} . 我创建了一个JavaPairRDD {Hour,MAP {ID,Ratio}} At any time the task would be running, I would require the map corresponding to that hour only. 在任务运行的任何时间,我都只需要对应于该小时的地图。 So I did the following:- 所以我做了以下事情:

Map<String, Double> result = new HashMap<>();
 javaRDDPair.lookup(HOUR).stream().forEach(map ->{
            result.putAll(map.entrySet().stream().collect(Collectors.toMap(entry-> entry.getKey(), entry-> entry.getValue())));
        });

This could now be further used as a broadcast variable. 现在可以将其进一步用作广播变量。

It's a common problem to work with a dataset in spark. 在Spark中使用数据集是一个常见的问题。 Usually there is a dataset which contains some samples as each row in it and each column represents a feature for each sample. 通常会有一个数据集,其中包含一些样本,因为其中的每一行,每一列代表每个样本的特征。 But a common solution for the common problem is defining an Entity to support each column as its properties and each sample would be a RDD object. 但是,针对该常见问题的通用解决方案是定义一个实体以支持每一列作为其属性,并且每个样本将是一个RDD对象。 To access each of these objects in the rdd can be done by using javapairrdd and set eg in this example HOUR as its key, the result would be something like: 要访问rdd中的每个对象,可以使用javapairrdd并将其设置为例如HOUR作为其键,结果将是这样的:

   Javapairrdd<INTEGER,Entity>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM