简体   繁体   English

在Ruby中检测相似的字符串。

[英]Detecting similar strings in Ruby.

In my db, there are entries eg. 在我的数据库中,有条目。 Тормозной диск , Диски тормозные LPR etc. in art_groups_arr array. Тормозной дискДиски тормозные LPR等在art_groups_arr阵列。 I would like to find all the entries similar to Тормозной диск , such as Диски тормозные LPR 我想找到所有类似于Тормозной диск的条目,例如Диски тормозные LPR

This code: 这段代码:

art_groups_arr.each do |artgrarr|
  if n2.art_group.include?(artgrarr)
    non_original << n2
  end
end

does not find them, obviously. 显然找不到。 How can I find those similar strings? 如何找到类似的字符串?

You can perhaps use regex, for example: 您也许可以使用正则表达式,例如:

art_groups_arr.each do |art_gr_arr|
  if n2.art_group.any? { |element|
    /ормозн/ =~ element and /диск/ =~ element
  } then non_original << n2 end
end

Alternatively, you can try out fuzz_ball gem that claims to implement Smith-Waterman algorithm. 另外, 您可以尝试声称实现Smith-Waterman算法的fuzz_ball gem

require 'fuzz_ball'
THRESHOLD_SCORE = 0.75
MATCHER = FuzzBall::Searcher.new [ 'Тормозной диск LPR' ]

def complies?( str )
  matchdata = MATCHER.search str
  return false if matchdata.nil? or matchdata.empty?
  score = matchdata[0][:score]
  puts "score is #{score}"
  score > THRESHOLD_SCORE
end

art_groups_arr.each do |art_gr_arr|
  if n2.art_group.any? { |element| complies? element } then
    non_original << n2
  end
end

For 'Диски тормозные LPR' you get score 0.861 , you have to tune the threshold. 对于'Диски тормозные LPR'您会获得0.861分数,您必须调整阈值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM