Is this the best way to unescape unicode escape sequences in Ruby?

Question

I have some text that contains Unicode escape sequences like \<. This is what I came up with to unescape it:

string.gsub(/\\u(....)/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}

Is it correct? (ie it seems to work with my tests, but can someone more knowledgeable find a problem with it?)

Answer 1

Your regex, /\\u(....)/ , has some problems.

First of all, \\u\u003c/code> doesn't work the way you think it does, in 1.9 you'll get an error and in 1.8 it will just match a single u rather than the \\u\u003c/code> pair that you're looking for; you should use /\\\\u/\u003c/code> to find the literal \\u\u003c/code> that you want.

Secondly, your (....) group is much too permissive, that will allow any four characters through and that's not what you want. In 1.9, you want (\\h{4}) (four hexadecimal digits) but in 1.8 you'd need ([\\da-fA-F]{4}) as \\h is a new thing.

So if you want your regex to work in both 1.8 and 1.9, you should use /\\\\u([\\da-fA-F]{4})/ . This gives you the following in 1.8 and 1.9:

>> s = 'Where is \u03bc pancakes \u03BD house? And u1123!'
=> "Where is \\u03bc pancakes \\u03BD house? And u1123!"
>> s.gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
=> "Where is μ pancakes ν house? And u1123!"

Using pack and unpack to mangle the hex number into a Unicode character is probably good enough but there may be better ways.

Is this the best way to unescape unicode escape sequences in Ruby?

Question

1 answers

solution1
17 ACCPTED 2011-08-10 20:18:44

Is this the best way to unescape unicode escape sequences in Ruby?

Question

1 answers

solution1 17 ACCPTED 2011-08-10 20:18:44

solution1
17 ACCPTED 2011-08-10 20:18:44