简体   繁体   English

如何使用gsub替换ruby中的多字节字符?

[英]How to replace multibyte characters in ruby using gsub?

I have a problem with saving records in MongoDB using Mongoid when they contain multibyte characters. 我在MongoDB中使用Mongoid保存记录时遇到问题,因为它们包含多字节字符。 This is the string: 这是字符串:

a="Chris \xA5\xEB\xAE\xDFe\xA5"

I first convert it to BINARY and I then gsub it like this: 我首先将它转换为BINARY,然后我像这样gsub

a.force_encoding("BINARY").gsub(0xA5.chr,"oo")

...which works fine: ......工作正常:

=> "Chris oo\xEB\xAE\xDFeoo"

But it seems that I can not use the chr method if I use Regexp : 但是如果我使用Regexp ,似乎我不能使用chr方法:

a.force_encoding("BINARY").gsub(/0x....?/.chr,"")
NoMethodError: undefined method `chr' for /0x....?/:Regexp

Anybody with the same issue? 有同样问题的人吗?

Thanks a lot... 非常感谢...

You can do that with interpolation 你可以用插值来做到这一点

a.force_encoding("BINARY").gsub(/#{0xA5.chr}/,"") 

gives

"Chris \xEB\xAE\xDFe"

EDIT: based on the comments, here a version that translates the binary encode string to an ascii representation and do a regex on that string 编辑:根据评论,这里的版本将二进制编码字符串转换为ascii表示,并对该字符串执行正则表达式

a.unpack('A*').to_s.gsub(/\\x[A-F0-9]{2}/,"")[2..-3] #=>"Chris "

the [2..-3] at the end is to get rid of the beginning [" and and trailing "] [2 ..- 3]最后是摆脱开头[“和尾随”]

NOTE: to just get rid of the special characters you also could just use 注意:要摆脱你也可以使用的特殊字符

a.gsub(/\W/,"") #=> "Chris"

The actual string does not contain the literal characters \\xA5: that is just how characters that would otherwise be unprintable are shown to you (similar when a string contains a newline ruby shows you \\n). 实际字符串不包含文字字符\\ xA5:这就是如何向您显示否则将无法打印的字符(类似于字符串包含换行符ruby时显示的情况)。

If you want to change any non ascii stuff you could do this 如果你想改变任何非ascii的东西,你可以这样做

a="Chris \xA5\xEB\xAE\xDFe\xA5"
a.force_encoding('BINARY').encode('ASCII', :invalid => :replace, :undef => :replace, :replace => 'oo')

This starts by forcing the string to the binary encoding (you always want to start with a string where the bytes are valid for its encoding. binary is always valid since it can contain arbitrary bytes). 这首先强制字符串为二进制编码(您总是希望以字符串对其编码有效的字符串开始。二进制文件始终有效,因为它可以包含任意字节)。 Then it converts it to ASCII. 然后它将其转换为ASCII。 Normally this would raise an error since there are characters that it doesn't know what to do with but the extra options we've passed tell it to replace invalid/undefined sequences with the characters 'oo' 通常这会引发错误,因为有些字符不知道该怎么做但我们传递的额外选项告诉它用字符'oo'替换无效/未定义的序列

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM