简体   繁体   中英

Fuzzy Compare between two hive columns using apache spark with scala

I am reading the data from 2 hive tables. Token table has the tokens that needs to be matched with the input data. Input data will have description column along with other columns. I need to split input data and need to compare each splitted element with all the elements from the token table. currently I am using me.xdrop.fuzzywuzzy.FuzzySearch library for fuzzy match.

below is my code snippet-

val tokens = sqlContext.sql("select token from tokens")
val desc = sqlContext.sql("select description from desceriptiontable")
val desc_tokens = desc.flatMap(_.toString().split(" "))

Now i need to iterate desc_tokens and each element of desc_tokens should be fuzzy matched with each element of tokens and it it exceeds 85% match i need to replace element from desc_tokens by element from the tokens.

Example --

My token list is

hello
this
is
token
file
sample

and my input description is

helo this is input desc sampl

code should return

hello this is input desc sample 

as hello and helo are fuzzy matched > 85% so helo will be replaced by hello. Similarly for sampl.

I make a test with this library : https://github.com/rockymadden/stringmetric

Other idea (Not optimized) :

//I change order tokens
val tokens = Array("this","is","sample","token","file","hello");
val desc_tokens = Array("helo","this","is","token","file","sampl");

val res = desc_tokens.map(str => {
  //Compute score beetween tokens and desc_tokens
  val elem = tokens.zipWithIndex.map{ case(tok,index) => (tok,index,JaroMetric.compare(str, tok).get)}
  //Get token has max score
  val emax = elem.maxBy{case(_,_,score) => score}
  //if emax have a score > 0.85 get It. Else keep input
  if(emax._3 > 0.85) tokens(emax._2) else str

})
res.foreach { println }

My Output : hello this is token file sample

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM