简体繁体 English

可以用Java在多个字符集中编码文件吗？

[英]Can a file be encoded in multiple charsets in Java?

原文 2012-05-14 13:22:19 0 3 java/ character-encoding/ java-io

I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. 我正在开发一个Java插件，它允许人们通过指定他们希望使用的字符集编码来写入和读取文件。 However, I was confused as to how I would encode multiple encodings in a single file. 但是，我对如何在单个文件中编码多个编码感到困惑。 For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file? 例如，假设A字符来自一个字符集而B字符来自另一个字符，是否可以将“AAAAABBBBBAAAAA”写入文件？

If it is not possible, is this generally true for any programming language, or specifically for Java? 如果不可能，对于任何编程语言，或者特别是Java，这通常是正确的吗？ And if it is possible, how would I then proceed to read (decode) the file? 如果有可能，我将如何继续读取（解码）文件？

I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). 我不想使用Charset的encode（）和decode（）方法，因为使用它们的测试失败了（一些字符集未被正确解码）。 I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code. 我也不想出于各种原因使用第三方程序，所以这个问题的范围纯粹是标准的java包/代码。

Thanks a lot! 非常感谢！
NS NS

3 个解决方案

You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. 您需要将其作为字节流读取并事先知道字符开始和结束的字节位置，或使用一些特殊的分隔符/字节范围来指示字符组的开始和结束。 This way you can get the bytes of the specific character group and finally decode it using the desired character encoding. 这样，您可以获取特定字符组的字节，最后使用所需的字符编码对其进行解码。

This problem is not specific to Java. 此问题并非特定于Java。 The requirement is just strange. 这个要求很奇怪。 I wonder how it makes sense to mix character encodings like that. 我想知道如何混合像这样的字符编码。 Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of. 只需使用一种统一编码，例如UTF-8，它几乎支持人类所知的所有字符。

Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this? 当然，原则上可以将以不同字符集编码的文本写入一个文件中，但为什么要这样做呢？

A character encoding is simply a mapping from text characters to bytes and vice versa. 字符编码只是从文本字符到字节的映射，反之亦然。 A file consists of bytes. 文件由字节组成。 When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters. 编写文件时，字符编码确定字符如何转换为字节，并在读取时确定如何将字节转换回字符。

You could have one part of the file encoded with one character encoding, and another part with another character encoding. 您可以将文件的一部分编码为一个字符编码，另一部分使用另一个字符编码。 You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you. 您必须有一些机制来跟踪哪些部分使用什么编码进行编码，因为该文件不会自动跟踪您的情况。

I was wondering about this as well, because my client just asked a similar question. 我也想知道这件事，因为我的客户刚问了一个类似的问题。 Like BalusC mentioned this is not a java specific problem. 就像BalusC提到的那样，这不是特定于java的问题。 After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file. 经过几次来回，我发现真正的问题可能是“信息的多重编码”，而不是多个编码文件。 ie we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. 即我们有一个xml字符串文本需要用8859-1编码，如果我们将其保存为文件，那么我们需要对其进行编码。 The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. xml的默认编码是UTF-8，我们可能不需要将整个xml编码为8859-1。 Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). 由于xml节点只是将信息传递给其他系统和内容（xml节点的值，需要与8859-1保持一致）的工具。 So do we need multiple encoding in this case? 那么在这种情况下我们需要多个编码吗？ probably not. 可能不是。 We can still encode the xml with UTF-8, then pass it over. 我们仍然可以使用UTF-8对xml进行编码，然后将其传递。 once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1. 一旦客户端收到xml，他们就需要从UTF-8编码文件中读取信息，并将xml节点的值保持为8859-1。