[英]How to fix broken UTF-8 string in ruby 2
I have as an input string tat thinks is UTF-8 but is not and need to fix it. 我有一个输入字符串tat认为是UTF-8,但不是,需要修复它。 The code is in ruby 2 so iconv is no more and encode or force_encode are not working as intended:
该代码在ruby 2中,因此iconv不再可用,并且encode或force_encode无法按预期工作:
[5] pry(main)> a='zg\u0142oszeniem'
=> "zg\\u0142oszeniem"
[6] pry(main)> a.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
=> "zg\\u0142oszeniem"
[8] pry(main)> a.encode!(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => "?")
=> "zg\\u0142oszeniem"
[10] pry(main)> a.force_encoding(Encoding::UTF_8)
=> "zg\\u0142oszeniem"
How can I fix it? 我该如何解决?
Here's solution using regex: 这是使用正则表达式的解决方案:
a.gsub(/\\\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) }
1 a.gsub(/\\\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) }
1
It should work for that particular string: 它应该适用于该特定字符串:
[1] pry(main)> before = 'zg\u0142oszeniem'
=> "zg\\u0142oszeniem"
[2] pry(main)> before.split('')
=> ["z", "g", "\\", "u", "0", "1", "4", "2", "o", "s", "z", "e", "n", "i", "e", "m"]
[3] pry(main)> after = before.gsub(/\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) }
=> "zgłoszeniem"
[4] pry(main)> after.split('')
=> ["z", "g", "ł", "o", "s", "z", "e", "n", "i", "e", "m"]
[1] Unicode codepoints can range from 0 to 10FFFF 16 ( definition D9 in Section 3.4, Characters and Encoding ), that should explains why above regex looks like that. [1] Unicode代码点的范围可以从0到10FFFF 16 ( 第3.4节“字符和编码”中的D9定义 ),这应该解释为什么上面的正则表达式看起来像这样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.