Apache Spark: JOINing RDDs (data sets) using custom criteria/fuzzy matching

Question

Is it possible to join two (Pair)RDD s (or Dataset s/ DataFrame s) (on multiple fields) using some "custom criteria"/fuzzy matching, eg range/interval for numbers or dates and various "distance methods", eg Levenshtein, for strings?

For "grouping" within an RDD to get a PairRDD , one can implement a PairFunction , but it seems that something similar is not possible when JOINing two RDD s/data sets? I am thinking something like:

rdd1.join(rdd2, myCustomJoinFunction);

I was thinking about implementing the custom logic in hashCode() and equals() but I am not sure how to make "similar" data wind up in the same bucket. I have also been looking into RDD.cogroup() but have not figured out how I could use it to implement this.

I just came across elasticsearc-hadoop . Does anyone know if that library could be used to do something like this?

I am using Apache Spark 2.0.0. I am implementing in Java but an answer in Scala would also be very helpful.

PS. This is my first Stackoverflow question so bear with me if I have made some newbie mistake :).

Answer 1

For DataFrames/Datasets you can use join with custom join function. Create an UDF that will be using columns from DataFrame, just like in this question in first answer .

You can also do

rdd1.cartesian(rdd2).filter (...)

Remember that it will consume much time to calculate

Apache Spark: JOINing RDDs (data sets) using custom criteria/fuzzy matching

Question

1 answers

solution1
0 ACCPTED 2016-09-02 19:24:21

Apache Spark: JOINing RDDs (data sets) using custom criteria/fuzzy matching

Question

1 answers

solution1 0 ACCPTED 2016-09-02 19:24:21

solution1
0 ACCPTED 2016-09-02 19:24:21