簡體   English   中英

如何在Ruby 2中修復損壞的UTF-8字符串

[英]How to fix broken UTF-8 string in ruby 2

我有一個輸入字符串tat認為是UTF-8,但不是,需要修復它。 該代碼在ruby 2中,因此iconv不再可用,並且encode或force_encode無法按預期工作:

[5] pry(main)> a='zg\u0142oszeniem'
=> "zg\\u0142oszeniem"
[6] pry(main)> a.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
=> "zg\\u0142oszeniem"
[8] pry(main)> a.encode!(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => "?")
=> "zg\\u0142oszeniem"
[10] pry(main)> a.force_encoding(Encoding::UTF_8)
=> "zg\\u0142oszeniem"

我該如何解決?

這是使用正則表達式的解決方案:

a.gsub(/\\\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) } 1

它應該適用於該特定字符串:

[1] pry(main)> before = 'zg\u0142oszeniem'
=> "zg\\u0142oszeniem"
[2] pry(main)> before.split('')
=> ["z", "g", "\\", "u", "0", "1", "4", "2", "o", "s", "z", "e", "n", "i", "e", "m"]
[3] pry(main)> after = before.gsub(/\\u([0-9a-fA-F]{1,5}|10[0-9a-fA-F]{4})/) { $1.hex.chr(Encoding::UTF_8) }
=> "zgłoszeniem"
[4] pry(main)> after.split('')
=> ["z", "g", "ł", "o", "s", "z", "e", "n", "i", "e", "m"]

[1] Unicode代碼點的范圍可以從0到10FFFF 16第3.4節“字符和編碼”中的D9定義 ),這應該解釋為什么上面的正則表達式看起來像這樣。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM