简体   繁体   English

将 unicode 代码点转换为 Ruby 中的字符串字符

[英]Convert unicode codepoint to string character in Ruby

I have these values from a unicode database but I'm not sure how to translate them into the human readable form.我有来自 unicode 数据库的这些值,但我不确定如何将它们转换为人类可读的形式。 What are these even called?这些甚至叫什么?

Here they are:他们来了:

  • U+2B71F
  • U+2A52D
  • U+2A68F
  • U+2A690
  • U+2B72F
  • U+2B4F7
  • U+2B72B

How can I convert these to there readable symbols?如何将这些转换为可读符号?

How about:怎么样:

# Using pack
puts ["2B71F".hex].pack("U")

# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)

In Ruby 1.9+ you can also do:在 Ruby 1.9+ 中,您还可以执行以下操作:

puts "\u{2B71F}"

Ie the \\u{}\u003c/code> escape sequence can be used to decode Unicode codepoints.\\u{}\u003c/code>转义序列可用于解码 Unicode 代码点。

The unicode symbols like U+2B71F are referred to as a codepoint .像的unicode符号U+2B71F被称为codepoint

The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing. unicode 系统为多种世界语言、科学符号、货币等中的每个字符定义了一个唯一的codepoint 。这个字符集正在稳步增长。

For example, U+221E is infinity.例如, U+221E是无穷大。

The codepoints are hexadecimal numbers. codepoints是十六进制数。 There is always exactly one number defined per character.每个字符总是定义一个数字。

There are many ways to arrange this in memory.有很多方法可以在内存中安排它。 This is known as an encoding of which the common ones are UTF-8 and UTF-16 .这被称为一种encoding ,其中常见的是UTF-8UTF-16 The conversion to and fro is well defined.来回转换是明确定义的。

Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.在这里,您很可能正在寻找将 unicode codepoint转换为UTF-8字符的方法。

codepoint = "U+2B71F"

You need to extract the hex part coming after U+ and get only 2B71F .您需要提取U+之后的十六进制部分并仅获得2B71F This will be the first group capture.这将是第一组捕获。 See this .看到这个

codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/

And you're UTF-8 character will be:而你的 UTF-8 字符将是:

utf_8_character = [$1.hex].pack("U")

References:参考:

  1. Convert Unicode codepoints to UTF-8 characters with Module#const_missing .使用 Module#const_missing 将 Unicode 代码点转换为 UTF-8 字符
  2. Tim Bray on the goodness of unicode . Tim Bray 谈到 unicode 的优点
  3. Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) . Joel Spolsky - 每个软件开发人员绝对、肯定必须了解 Unicode 和字符集的绝对最低要求(没有任何借口!)
  4. Dissecting the Unicode regular expression剖析 Unicode 正则表达式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM