简体   繁体   English

如何在Ruby中进行模糊子串匹配?

[英]How can I do fuzzy substring matching in Ruby?

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score. 我找到了许多关于模糊匹配的链接,将一个字符串与另一个字符串进

I have one very long string, which is a document, and a substring. 我有一个非常长的字符串,它是一个文档和一个子字符串。 The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. 子字符串来自原始文档,但已被多次转换,因此可能引入了奇怪的工件,例如此处的空格,字符串。 The substring will match a section of the text in the original document 99% or more. 子字符串将匹配原始文档中文本的一部分99%或更多。 I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts. 我不匹配以查看此字符串是哪个文档,我试图在文档中找到字符串开始的索引。

If the string was identical because no random error was introduced, I would use document.index(substring) , however this fails if there is even one character difference. 如果字符串相同,因为没有引入随机错误,我会使用document.index(substring) ,但如果有一个字符差异,则会失败。

I thought the difference would be accounted for by removing all characters except az in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. 我认为通过在字符串和子字符串中删除除az之外的所有字符来比较差异,然后使用压缩字符串时生成的索引将压缩字符串中的索引转换为真实文档中的索引。 This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed. 这种情况很好用,其中差异是空格和标点符号,但只要一个字母不同就失败了。

The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages. 该文档通常是几页到一百页,而子串从几个句子到几页。

You could try amatch. 你可以试试amatch。 It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. 它可用作红宝石的宝石,尽管我长时间没有使用模糊逻辑,它看起来有你需要的东西。 The homepage for amatch is: http://flori.github.com/amatch/ . amatch的主页是: http ://flori.github.com/amatch/。

Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows: 对这个想法感到厌倦和烦恼,一个完全没有优化和未经测试的解决方案如下:

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

Obviously there are numerous improvements possible and probably necessary! 显然,可能有许多改进,可能是必要的! A few off the top: 一些顶部:

  1. Process the document once and store the results, possibly in a database. 处理文档一次并将结果存储在数据库中。
  2. Determine a usable length of string for an initial check, process against that initial substring first before trying to match the entire fragment. 确定初始检查的字符串的可用长度,在尝试匹配整个片段之前首先处理该初始子字符串。
  3. Following up on the previous, precalculate starting fragments of that length. 跟进之前的那个长度的预先计算的起始片段。

A simple one is fuzzy_match 一个简单的就是fuzzy_match

require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus

A more elaborated one (you wouldn't say it from this example though) is levenshein , which computes the number of differences. 一个更详细的(虽然你不会从这个例子中说出来)是levenshein ,它计算差异的数量。

require 'levenshtein' 
Levenshtein.distance('test', 'test')    # => 0
Levenshtein.distance('test', 'tent')    # => 1

You should look at the StrikeAMatch implementation detailed here: A better similarity ranking algorithm for variable length strings 您应该查看此处详述的StrikeAMatch实现: 针对可变长度字符串的更好的相似性排序算法

Instead of relying on some kind of string distance (ie number of changes between two strings), this one looks at the character pairs patterns. 这个人不是依赖某种字符串距离(即两个字符串之间的变化次数),而是查看字符对模式。 The more character pairs occur in each string, the better the match. 每个字符串中出现的字符对越多,匹配就越好。 It has worked wonderfully for our application, where we search for mistyped/variable length headings in a plain text file. 它对我们的应用程序非常有效,我们在纯文本文件中搜索错误类型/可变长度标题。

There's also a gem which combines StrikeAMatch (an implementation of Dice's coefficient on character-level bigrams) and Levenshtein distance to find matches: https://github.com/seamusabshere/fuzzy_match 还有一个宝石结合了StrikeAMatch( Dice系数在字符级别的双桅杆上的实现)和Levenshtein距离来寻找匹配: https//github.com/seamusabshere/fuzzy_match

It depends on the artifacts that can end up in the substring. 它取决于最终可能在子字符串中的工件。 In the simpler case where they are not part of [az] you can use parse the substring and then use Regexp#match on the document: 在更简单的情况下,它们不是[az]一部分,您可以使用解析子字符串,然后在文档上使用Regexp#match

document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&#   +illam"

re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam

(Here, as we do not set any parenthesis in the Regexp, we use begin and end on the first (full match) element 0 of MatchData . (这里,由于我们没有在Regexp中设置任何括号,我们在MatchData的第一个(完全匹配)元素0上使用beginend

If you are only interested in the start position, you can use =~ operator: 如果您只对起始位置感兴趣,可以使用=~运算符:

start_pos = document =~ re

I have used none of them, but I found some libraries just by doing a search for 'diff' in rubygems.org . 我没有使用它们,但我发现一些库只是通过在rubygems.org搜索'diff'。 All of them can be installed by gem. 所有这些都可以通过gem安装。 You might want to try them. 你可能想尝试一下。 I myself is interested, so if you already know these or if you try them out, it would be helpful if you leave your comment. 我自己很感兴趣,所以如果你已经知道这些或者你试过它们,那么如果你留下你的评论会很有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM