简体   繁体   中英

Ruby: Smiley to utf-8 encoding

How can I convert this

string = "ok test body 😁😁😁\r\n-- \r\n test"

Into this

"ok test body \\ud83d\\ude01\\ud83d\\ude01\\ud83d\\ude01\r\n-- \r\n test"

I have tried this

string.encode('utf-16be','utf-8')

which convert it into this form

#"ok test body \u{1F601} \u{1F601}\u{1F601}\r\n-- \r\n test"

I think i need regular expression to solve this. Can anyone tell me how to do that. Thanks

Using this previous answer , this code just applies the 'U+1F601' to "\?\?" conversion to non-ascii characters :

encoded_string = string.gsub(/[^[:ascii:]]/) do |non_ascii|
  non_ascii.force_encoding('utf-8')
           .encode('utf-16be')
           .unpack('H*').first
           .gsub(/(....)/,'\u\1')
end

For :

string = "ok test body 😁😁😁\r\n-- \r\n test"

it outputs:

"ok test body \\ud83d\\ude01\\ud83d\\ude01\\ud83d\\ude01\r\n-- \r\n test"

Quite similar to Eric Duminil's answer :

string.gsub(/[\u{10000}-\u{10FFFF}]/) { |m|
  '\u%s\u%s' % m.encode('UTF-16BE').unpack('H4H4')
}
#=> "ok test body \\ud83d\\ude01\\ud83d\\ude01\\ud83d\\ude01\r\n-- \r\n test"

The regular expression matches code points U+10000 to U+10FFFF, ie characters from the Supplementary Planes . In UTF-16, these are represented as so-called surrogate pairs .

Each matched character is split via unpack into its high and low surrogate: (the pattern H4 extracts 4 hexadecimal characters, ie 2 bytes or 16 bits)

'😁'.encode('UTF-16BE').unpack('H4H4')
#=> ["d83d", "de01"]

The result is formatted via % :

'\u%s\u%s' % ["d83d", "de01"]
#=> "\\ud83d\\ude01"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM