简体   繁体   中英

Apache Spark: JOINing RDDs (data sets) using custom criteria/fuzzy matching

Is it possible to join two (Pair)RDD s (or Dataset s/ DataFrame s) (on multiple fields) using some "custom criteria"/fuzzy matching, eg range/interval for numbers or dates and various "distance methods", eg Levenshtein, for strings?

For "grouping" within an RDD to get a PairRDD , one can implement a PairFunction , but it seems that something similar is not possible when JOINing two RDD s/data sets? I am thinking something like:

rdd1.join(rdd2, myCustomJoinFunction);

I was thinking about implementing the custom logic in hashCode() and equals() but I am not sure how to make "similar" data wind up in the same bucket. I have also been looking into RDD.cogroup() but have not figured out how I could use it to implement this.

I just came across elasticsearc-hadoop . Does anyone know if that library could be used to do something like this?

I am using Apache Spark 2.0.0. I am implementing in Java but an answer in Scala would also be very helpful.

PS. This is my first Stackoverflow question so bear with me if I have made some newbie mistake :).

For DataFrames/Datasets you can use join with custom join function. Create an UDF that will be using columns from DataFrame, just like in this question in first answer .

You can also do

rdd1.cartesian(rdd2).filter (...)

Remember that it will consume much time to calculate

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM