简体   繁体   English

spark-如何在另一个RDD转换中查找(Java)PairRDD的键和值

[英]spark - how to look up the keys and values of a (Java)PairRDD inside another RDD's transformation

I have a PairRDD rdd1 with Integer keys and Integer[] values. 我有一个带有Integer键和Integer []值的PairRDD rdd1

I also have another PairRDD rdd2 with Integer keys and Double values. 我还有另一个带有整数键和Double值的PairRDD rdd2

Each Integer in the key AND the value of rdd1 also exists in rdd2 as a key. 密钥中与值rdd1每个整数也作为密钥存在于rdd2中。

I want for each pair (x, [y1,y2,...,yn]) in rdd1 to get the double value of the x and all the double values of each Integer y1 , y2 , ..., yn . 我想为rdd1每对(x, [y1,y2,...,yn])获得x的双rdd1值以及每个Integer y1y2 ,..., yn所有双rdd1值。

I tried collecting rdd2 as a Map<Integer,Double> ( map2 ), but it does not fit in memory and I get OOM errors. 我尝试将rdd2收集为Map<Integer,Double>map2 ),但它不适合内存,并且出现OOM错误。 I also tried joining the rdds, but I could not figure out how to join both the keys and the values. 我也尝试加入rdds,但是我不知道如何结合键和值。 Using rdd2 's lookup() method inside rdd1 is not allowed. 不允许在rdd1内使用rdd2lookup()方法。

The pseudocode of what I want is the following: 我想要的伪代码如下:

map each (int x, int[] y) in rdd1 to:
      (x, map2.get(x) + sum(map2.get(yi)))

for each yi in y . 每个yiy

I use Java, but I guess the same problem holds in both Java and Scala. 我使用Java,但是我猜Java和Scala都存在相同的问题。

Depending on what you want to do with missing matches (cases where there's an index in rdd1 and no corresponding index in rdd2 ), the query looks something like the following. 根据您想要缺少比赛(的情况下有一个索引做什么rdd1并没有相应的指数rdd2 ),查询看起来像下面这样。

rdd1.
    // ( x, [ y1, ..., yn ] ) -> ( x, x ), ( y1, x ), ..., ( yn, x )
    flatMap { case ( x, ys ) => ( x :: ys ).map( ( _, x ) ) }.
    // ( xory, x ) -> ( xory, ( x, rdd2.lookup( xory ) ) )
    leftOuterJoin( rdd2 ).
    // ( xory, ( x, rdd2.lookup( xory ) ) ) -> ( x, rdd2.lookup( xory ) )
    map( _._2 ).
    // ( x, rdd2.lookup( x ) ), ... -> ( x, rdd2.lookup( x ) + sum_i( rdd2.lookup( y_i ) )
    reduceByKey{ case ( dopt1, dopt2 ) => ( dopt1 ++ dopt2 ).reduceOption( _ + _ ) }.
    // unwrap the option types
    mapValues( _.getOrElse( 0.0 ) )
HashMap<Integer, List<Integer>> map = new HashMap<>();
    map.put(1,asList(2,3));
    map.put(3,asList(4,5));

    System.out.println(
            map.entrySet().stream()
                    .flatMap(kv -> 
                            Stream.concat(
                                    Stream.of((double)kv.getKey()),
                                    kv.getValue().stream().mapToDouble( x -> Double.valueOf((double)x) ).boxed())
                    )
                    .collect(Collectors.toList())
            );

How about this? 这个怎么样? ... should give you all (keys and values) in one RDD which you can use as keys in your second RDD. ...应该在一个RDD中为您提供所有(键和值),您可以将其用作第二个RDD中的键。 You can of course change the type. 您当然可以更改类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM