简体   繁体   中英

ruby string convert ascii to unicode

I have a string which has ascii special characters and i want to convert those to respective unicode characters. For example below is the string

A “razor” is a rule of thumb that simplifies decision..  \nWe’re in a post-content age.  In the past,\nhealthier, wealthier life:  • Toxic relationships • Comparisons • Inactivity • Complaints • Instant gratification • Overthinking • Crazy “what if” fears 

Expect output

A "razor" is a rule of thumb that simplifies decision..  \nWe're in a post-content age.  In the past,\nhealthier, wealthier life:  • Toxic relationships • Comparisons • Inactivity • Complaints • Instant gratification • Overthinking • Crazy "what if" fears

The best result I could get is using unidecode gem. Which converted the above string to this

"A \"razor\" is a rule of thumb that simplifies decision..\nWe're in a post-content age.  In the past,\nhealthier, wealthier life:  * Toxic relationships * Comparisons * Inactivity * Complaints * Instant gratification * Overthinking * Crazy \"what if\" fears "

The problem with the approach is unidecode to_ascii method will convert the character if the string is in another language.

So what you are asking about is not ascii but ASNI also known as windows-1252, I would recommend you take a look at the Windows-1252 wiki as it has a table with the Unicode code points marked on the table. Essentually there is no easy and quick way to convert from ansi to unicode and the way it was done with the table in that wiki page is the same glyph was found in unicode and substituted in.

One thing about ansi, asci, and unicode is the first 128 characters are all the same between them.

personally I would just make a look up table and also how ruby seems to handle unicode character strings is using the following: "\u<code point in hex>" where you replace the <code point in hex> with the hexadecimal value of the code point so for say the bullet point: "•" would be converted to: "•" if you need to look up unicode code points I recommend: unicodeplus.com as it even gives you the escape sequence used for each codepoint for a few different programming languages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM