Is it possible to join two (Pair)RDD
s (or Dataset
s/ DataFrame
s) (on multiple fields) using some "custom criteria"/fuzzy matching, eg range/interval for numbers or dates and various "distance methods", eg Levenshtein, for strings?
For "grouping" within an RDD
to get a PairRDD
, one can implement a PairFunction
, but it seems that something similar is not possible when JOINing two RDD
s/data sets? I am thinking something like:
rdd1.join(rdd2, myCustomJoinFunction);
I was thinking about implementing the custom logic in hashCode()
and equals()
but I am not sure how to make "similar" data wind up in the same bucket. I have also been looking into RDD.cogroup()
but have not figured out how I could use it to implement this.
I just came across elasticsearc-hadoop . Does anyone know if that library could be used to do something like this?
I am using Apache Spark 2.0.0. I am implementing in Java but an answer in Scala would also be very helpful.
PS. This is my first Stackoverflow question so bear with me if I have made some newbie mistake :).
For DataFrames/Datasets you can use join with custom join function. Create an UDF that will be using columns from DataFrame, just like in this question in first answer .
You can also do
rdd1.cartesian(rdd2).filter (...)
Remember that it will consume much time to calculate
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.