Ruby 1.8 regexp: index of match in utf string

Question

I'm trying to search a text for a match and return it with snippet around it. For this, I want to find match with regex, then cut the string using match index +- snippet radius (text.mb_chars[start..finish]).

However, I cannot get ruby's (1.8) regex to return match index which would be multi-byte aware.

I understand that regex is one place in 1.8 which is supposed to be utf aware, but it doesn't seem to work despite /u switch:

"Résumé" =~ /s/u
=> 3

"Resume" =~ /s/u
=> 2

Result should be the same if regex was really working in multibyte (/u), but it's returning byte index.

How you get match index in characters, not bytes?

Or maybe some other way to get snippet around (each) match?

Answer 1

Not a real answer, but too long for a comment.

The code

print "Résumé" =~ /s/u
print "\n"
print "Resume" =~ /s/u

on Windows (Ruby 1.8.6, release 26.) prints:

2
2

And on Linux (ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]) it prints:

3
2

Answer 2

How about using this jindex function I wrote, which corresponds to the other methods in the jcode library:

class String
  def jslice *args
    split(//)[*args].join rescue ""
  end
  def jindex match, start=0
    if match.is_a? String
      match = Regexp.new(Regexp.escape(match))
    end
    if self.jslice(start..-1) =~ match
      $PREMATCH.jlength + start
    else
      nil
    end
  end
end

Ruby 1.8 regexp: index of match in utf string

Question

2 answers

solution1
0 2010-04-21 09:53:56

solution2
0 2011-02-08 17:05:41

Ruby 1.8 regexp: index of match in utf string

Question

2 answers

solution1 0 2010-04-21 09:53:56

solution2 0 2011-02-08 17:05:41

solution1
0 2010-04-21 09:53:56

solution2
0 2011-02-08 17:05:41