简体   繁体   中英

Filtering JavaPairRDD based on a JavaRDD in Spark

I'm very new in Apache Spark. I need a Java solution for the problem below:

JavaPairRDD:        JavaRDD:           Desired Output:

1,USA               France             2,England
2,Engand            England            3,France
3,France
4,Italy 

Edit: Frankly, I have no idea about what I can try. Like I said, I'm very very newbie at spark. I just thought I can use a method something like instersection. But it requires another JavaPairRDD object. I think the filter method won't work for this problem. For example,

Function<Tuple2<String, String>, Boolean> myFilter =
  new Function<Tuple2<String, String>, Boolean>() {
    public Boolean call(Tuple2<String, String> keyValue)
      {
        return ("some boolean expression");
      }
    };
myPairRDD.filter(myFilter);

I have no idea what kind of boolean expression I can write instead of "some boolean expression" in above function. Sorry for my English by the way.

There are at least three options:

  • map JavaRDD to JavaPairRDD with arbitrary value, join and map to drop dummy values
  • if number of unique values in JavaRDD is small, collect distinct values, convert to Set , broadcast and use it to filter JavaPairRDD
  • convert both RDDs to DataFrames and use inner join followed by drop / select .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM