简体   繁体   English

如何在Ruby 1.9中为unicode西里尔字符指定Regexp

[英]How to specify Regexp for unicode cyrillic characters in Ruby 1.9

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8> 
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why \\w ignore cyrillic characters? 问题是为什么\\w忽略西里尔字符?

I have installed latest ruby package from http://rubyinstaller.org/ . 我已经从http://rubyinstaller.org/安装了最新的ruby软件包。 Here is my output of ruby -v 这是我的ruby -v输出

ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

As far as i know 1.9 oniguruma regular expression library has full support for unicode characters. 据我所知1.9 oniguruma正则表达式库完全支持unicode字符。

This is as specified in the Ruby documentation : \\w is equivalent to [a-zA-Z0-9_] and thus doesn't target any unicode character. 这是在Ruby文档中指定的: \\w等同于[a-zA-Z0-9_] ,因此不针对任何unicode字符。

You probably want to use [[:alnum:]] instead, which includes all unicode alphabetic and numeric characters. 您可能希望使用[[:alnum:]] ,其中包括所有unicode字母和数字字符。 Check also [[:word:]] and [[:alpha:]] . 还要检查[[:word:]][[:alpha:]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM