简体   繁体   English

Ruby脚本中的Unicode字符?

[英]Unicode characters in a Ruby script?

I would like to write a Ruby script which writes Japanese characters to the console. 我想写一个Ruby脚本,它将日文字符写入控制台。 For example: 例如:

puts "こんにちは・今日は"

However, I get an exception when running it: 但是,运行它时会出现异常:

jap.rb:1: Invalid char `\377' in expression
jap.rb:1: Invalid char `\376' in expression

Is it possible to do? 有可能吗? I'm using Ruby 1.8.6. 我正在使用Ruby 1.8.6。

You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. 您已将文件保存为UTF-16LE编码,一个Windows误导性地称为“Unicode”。 This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \\0 . 通常最好避免使用此编码,因为它不是ASCII超集:每个代码单元存储为两个字节,ASCII字符的另一个字节存储为\\0 This will confuse an awful lot of software; 这会混淆很多软件; it is unusual to use UTF-16 for file storage. 使用UTF-16进行文件存储是不常见的。

What you are seeing with \\377 and \\376 (octal for \\xFF and \\xFE ) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE. 您所看到的\\377\\376 (八进制为\\xFF\\xFE )是U + FEFF字节顺序标记序列放在UTF-16文件的前面,以区分UTF-16LE和UTF-16BE。

Ruby 1.8 is totally byte-based; Ruby 1.8完全基于字节; it makes no attempt to read Unicode characters from a script. 它不会尝试从脚本中读取Unicode字符。 So you can only save source files in ASCII-compatible encodings. 因此,您只能以ASCII兼容编码保存源文件。 Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). 通常,您希望将文件保存为UTF-8(没有BOM; UTF-8虚拟BOM是另一项伟大的Microsoft创新,可以破坏所有内容)。 This'd work great for scripts on the web producing UTF-8 pages. 这对于生成UTF-8页面的Web上的脚本非常有用。

And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable): 如果您想确保源代码能够容忍以任何与ASCII兼容的编码保存,您可以对字符串进行编码以使其更具弹性(如果不太可读):

puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"

However! 然而! Writing to the console is itself a big problem. 写入控制台本身就是一个大问题。 What encoding is used to send characters to the console varies from platform to platform. 使用什么编码将字符发送到控制台因平台而异。 On Linux or OS X, it's UTF-8. 在Linux或OS X上,它是UTF-8。 On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. 在Windows上,它是每个安装区域设置的不同编码(在“区域和语言选项”控制面板条目中的“非Unicode应用程序的语言”中选择),但它从不是 UTF-8。 This setting is—again, misleadingly—known as the ANSI code page. 此设置再次被误导地称为ANSI代码页。

So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). 因此,如果您使用的是日语Windows安装,则您的控制台编码将是Windows代码页932(Shift-JIS的变体)。 If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. 如果是这种情况,您可以使用“ANSI”或显式“日语cp932”从文本编辑器中保存文本文件,当您在Ruby中运行它时,您将获得正确的字符。 Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding: 同样,如果你想使源代码能够承受错误编码,你可以在cp932编码中转义字符串:

puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"

But if you run it on a machine in another locale, it'll produce different characters. 但是如果你在另一个语言环境中的机器上运行它,它将产生不同的字符。 You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252). 在Western Windows安装(代码页1252)上,您将无法从Ruby将日语写入默认控制台。

(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.) (虽然Ruby 1.9大大改进了Unicode处理,但它并没有改变任何东西。它仍然是一个使用C标准库IO函数的基于字节的应用程序,这意味着它仅限于Windows的本地代码页。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM