简体   繁体   中英

Can a file be encoded in multiple charsets in Java?

I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. However, I was confused as to how I would encode multiple encodings in a single file. For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file?

If it is not possible, is this generally true for any programming language, or specifically for Java? And if it is possible, how would I then proceed to read (decode) the file?

I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code.

Thanks a lot!
NS

You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. This way you can get the bytes of the specific character group and finally decode it using the desired character encoding.

This problem is not specific to Java. The requirement is just strange. I wonder how it makes sense to mix character encodings like that. Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of.

Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this?

A character encoding is simply a mapping from text characters to bytes and vice versa. A file consists of bytes. When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters.

You could have one part of the file encoded with one character encoding, and another part with another character encoding. You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you.

I was wondering about this as well, because my client just asked a similar question. Like BalusC mentioned this is not a java specific problem. After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file. ie we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). So do we need multiple encoding in this case? probably not. We can still encode the xml with UTF-8, then pass it over. once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM