简体繁体 English

Java - UTF8 / 16是字符集名称还是字符编码？

[英]Java - UTF8/16 is a Charset Name or Character Encoding?

原文 2013-03-11 20:48:57 9 3 java/ character-encoding

The application I am developing will be used by folks in Western & Eastern Europe as well in the US. 我正在开发的应用程序将被西欧和东欧以及美国的人们使用。 I am encoding my input and decoding my output with UTF-8 character set. 我正在编码输入并使用UTF-8字符集解码输出。

My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. 我的困惑是因为当我使用这个方法String（byte [] bytes，String charsetName）时，我提供UTF-8作为charsetname，当它真的是一个字符编码时。 And my default econding is set in Eclipse as Cp1252. 我的默认econding在Eclipse中设置为Cp1252。

Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa? 这是否意味着，如果在我的Java应用程序中的美国，我使用Cp1252作为我的字符集编码创建一个输出文本文件，而UTF-8作为我的字符集名称，那么欧洲的人们是否能够在我的Java应用程序中读取此文件反之亦然？

3 个解决方案

They're encodings. 他们是编码。 It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake . 遗憾的是，当Java真正意味着“编码”时，它会使用“charset”，但现在很难解决这个问题:(令人讨厌的是， IANA犯了同样的错误。

Actually, by Unicode terminology they're probably most accurately character encoding schemes : 实际上，通过Unicode术语，它们可能是最准确的字符编码方案：

A character encoding form plus byte serialization. 字符编码形式加字节序列化。 There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. Unicode中有七种字符编码方案：UTF-8，UTF-16，UTF-16BE，UTF-16LE，UTF-32，UTF-32BE和UTF-32LE。

Where a character encoding form is: 字符编码形式的位置是：

Mapping from a character set definition to the actual code units used to represent the data. 从字符集定义映射到用于表示数据的实际代码单元。

Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. 是的，Unicode只定义了七种字符编码形式这一事实使得这更令人困惑。 Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data ( String , char[] ) and binary data ( byte[] ). 从根本上说， 大多数开发人员需要知道的是，Java术语中的“charset”是文本数据（ String ， char[] ）和二进制数据（ byte[] ）之间的映射。

I think those two things are not directly related. 我认为这两件事并没有直接关系。

The Eclipse setting decide how your eclipse editor will save the text file (typically source code) you created/edited. Eclipse设置决定了你的eclipse编辑器如何保存你创建/编辑的文本文件（通常是源代码）。 You can use other editors and therefore the file maybe saved in some other encoding scheme. 您可以使用其他编辑器，因此文件可能会保存在其他编码方案中。 As long as your java compiler has no problem compiling your source code you're safe. 只要您的java编译器编译源代码没有问题，您就是安全的。

The java String(byte[] bytes, String charsetName) is your own application logic that deals with how do you want to interpret some data your read either from a file or network. java String(byte[] bytes, String charsetName)是您自己的应用程序逻辑，它处理您如何解释从文件或网络读取的某些数据。 Different charsetName (essentially different character encoding scheme) may have different interpretation on the byte array. 不同的charsetName（本质上不同的字符编码方案）可能对字节数组有不同的解释。

A "charset" does implies the set of characters that the text uses. “charset”确实意味着文本使用的字符集。 For UTF-8/16, the character set happens to be "all" characters. 对于UTF-8/16，字符集恰好是“所有”字符。 For others, not necessarily. 对于其他人，不一定。 Back in the days, everybody were inventing their own character sets and encoding schemes, and the two were almost 1-to-1 mapping, therefore one name can be used to refer to both character set and encoding scheme. 在这些日子里，每个人都在发明自己的字符集和编码方案，两者几乎是一对一的映射，因此一个名称可以用来指代字符集和编码方案。