简体   繁体   中英

how to compare between two pair rdd

i have two pair RDDs r1 and r2 containing tuples defined as

Tuple2<Integer,String[]> 

what i want to do is to find tuples from both RDDs that have the same key than compare every single elements of the value part (String []) from r1 with other elements from r2, than return the index of elements which they are diffrent, giving an example, lets suppose that r1 is like:

{ (1,["a1","b1","c1"]) (2,["x1","y1","z1"])...}

and r2 is like:

{ (1,["a2","b2","c2"]) (3,["x2","y2","z2"])...}

if we see here, the key (1) exists in both of RDDs so it is concerned, now i want to sweep the value part in both of RDDs and compare elements one by one with elements that have the same index in the other RDD, and when i find that the same element (having same index in the tuple from r1 and the tuple from r2), i return the value of its index, lets explain it

this is the tuple that has the key 1 in r1 :

  (1,["a1","b1","c1"])

and this is the tuple that has the key 1 in r2 :

(1,["a2","b2","c2"])

by sweeping, i compare "a1" with "a2", "b1" with "b2", and "c1" with "c2"

i assume that after comparaison i found :

"a1".equals"a2"=true, "b1".equals"b2"=false, and "c1".equals"c2"=false

knowing that indexes of tables in java starts with 0, and as i said before i want to return indexes of elements which are not equals, following this example i ll return index1=1 and index2=2, how can i do this?

Note: if i have to return more than one index, i think it ll be better that i collect them in one RDD of INtegers named

  JavaRDD <Integer> indexes

i hope that it s clean, and i ll appreciate any help from your sides, thank you.

You could do this with join and then map .

JavaPairRDD<Integer,Integer[]> idWithIndexes = r1.join(r2).map(new Function<Tuple2<Integer,Tuple2<String[],String[]>>,Tuple2<Integer,Integer[]>>(){
    @Override
    public Tuple2<Integer, Integer[]> call(Tuple2<Integer, Tuple2<String[], String[]>> t) throws Exception {
        int id = t._1;
        String[] s1 = t._2._1;
        String[] s2 = t._2._2;
        int length = Math.min(s1.length, s2.length);

        List<Integer> index = new ArrayList<Integer>();
        for (int i = 0; i < length; i++) {
            if (!s1[i].equals(s2[i])) {
                index.add(i);
            }
        }

        return new Tuple2<Integer,Integer[]>(id, index.toArray(new Integer[0]));
    }   
});

This returns JavaPairRDD of id and index array.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM