[英]Ruby 1.8 regexp: index of match in utf string
I'm trying to search a text for a match and return it with snippet around it. 我正在尝试搜索文本以找到匹配项,并在其周围添加摘要。 For this, I want to find match with regex, then cut the string using match index +- snippet radius (text.mb_chars[start..finish]). 对于这一点,我要找到匹配的正则表达式,然后用切配建指标+字符串 - 片断半径(text.mb_chars [start..finish])。
However, I cannot get ruby's (1.8) regex to return match index which would be multi-byte aware. 但是,我无法获取ruby(1.8)的正则表达式来返回匹配索引,该索引将是多字节感知的。
I understand that regex is one place in 1.8 which is supposed to be utf aware, but it doesn't seem to work despite /u switch: 据我所知,正则表达式是在1.8一个地点的应该是UTF知道,但它似乎并不尽管/ u开关的工作:
"Résumé" =~ /s/u
=> 3
"Resume" =~ /s/u
=> 2
Result should be the same if regex was really working in multibyte (/u), but it's returning byte index. 如果正则表达式确实在多字节(/ u)中工作,则结果应该相同,但它返回的是字节索引。
How you get match index in characters, not bytes? 你如何获得字符,而不是字节匹配指数?
Or maybe some other way to get snippet around (each) match? 或者一些其他的办法让周围的片断(每个)的比赛?
Not a real answer, but too long for a comment. 这不是一个真正的答案,但是评论太久了。
The code 编码
print "Résumé" =~ /s/u
print "\n"
print "Resume" =~ /s/u
on Windows (Ruby 1.8.6, release 26.) prints: 在Windows(Ruby 1.8.6,版本26.)上打印:
2
2
And on Linux (ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]) it prints: 在Linux(红宝石1.8.7(2009-06-12补丁程序级别174)[i486-linux])上打印:
3
2
How about using this jindex
function I wrote, which corresponds to the other methods in the jcode
library: 如何使用这个jindex
功能我写的,这对应于其它方法jcode
库:
class String
def jslice *args
split(//)[*args].join rescue ""
end
def jindex match, start=0
if match.is_a? String
match = Regexp.new(Regexp.escape(match))
end
if self.jslice(start..-1) =~ match
$PREMATCH.jlength + start
else
nil
end
end
end
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.