简体   繁体   中英

Is this the best way to unescape unicode escape sequences in Ruby?

I have some text that contains Unicode escape sequences like \<. This is what I came up with to unescape it:

string.gsub(/\\u(....)/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}

Is it correct? (ie it seems to work with my tests, but can someone more knowledgeable find a problem with it?)

Your regex, /\\u(....)/ , has some problems.

First of all, \\u\u003c/code> doesn't work the way you think it does, in 1.9 you'll get an error and in 1.8 it will just match a single u rather than the \\u\u003c/code> pair that you're looking for; you should use /\\\\u/\u003c/code> to find the literal \\u\u003c/code> that you want.

Secondly, your (....) group is much too permissive, that will allow any four characters through and that's not what you want. In 1.9, you want (\\h{4}) (four hexadecimal digits) but in 1.8 you'd need ([\\da-fA-F]{4}) as \\h is a new thing.

So if you want your regex to work in both 1.8 and 1.9, you should use /\\\\u([\\da-fA-F]{4})/ . This gives you the following in 1.8 and 1.9:

>> s = 'Where is \u03bc pancakes \u03BD house? And u1123!'
=> "Where is \\u03bc pancakes \\u03BD house? And u1123!"
>> s.gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
=> "Where is μ pancakes ν house? And u1123!"

Using pack and unpack to mangle the hex number into a Unicode character is probably good enough but there may be better ways.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM