Handling inputs with unsupported and/or multiple charsets in Java

Question

I am writing a Java (7 SE) app to ingest all sorts of text-based inputs, and am concerned about running into character sets/encodings that the JRE doesn't support (for instance this app will run on a Linux box but will be ingesting files generated on every major OS, etc.).

For one, is there a way to catch an IOException (or similar) if the InputStreamReader encounters an unsupported charset/encoding?

And what about inputs that contain multiple encodings? Say we have 4 different types of inputs:

Raw java.lang.String s
Plaintext ( .txt ) files
Word ( .docx ) files
PDF files

What if we're reading one of these inputs and we start encountering multiple (but supported) character encodings? Does the JRE natively handle this, or do I have to have multiple readers, each configured with it's own charset/encoding?

In such a case, could I "normalize" the streaming inputs to a single, standardized (UTF-8 most likely) set/encoding? Thanks in advance.

Answer 1

To answer your first question you can create a CharsetDecoder and specify what you want to happen when you encounter malformed input.

CharsetDecoder charsetDecoder = Charset.forName("utf-8").newDecoder();
charsetDecoder.onMalformedInput(myCustomErrorAction);
charsetDecoder.onUnmappableCharacter(myCustomErrorAction);
Reader inputReader = new InputStreamReader(inputStream, charsetDecoder);

As for catching a case where an entire charset is not supported it would look something like:

if( Charset.isSupported(encodingSpecified)) {
    //Normal case
} else {
    //Error case
}

I'm not sure about multiple encodings however. I would think it is extremely unusual for a single binary stream to have multiple encodings. The stream would have to have some custom way of indicating the encoding change. You would have to read from the stream one character at a time looking for that indicator. If you encountered it you would then have to create a new reader on the same stream with the new encoding.

In all cases, in Java, once you go from a stream of bytes to a stream of characters those characters are going to be represented in memory without any specific encoding so there is no need to normalize unless you're saving the data back out somewhere. If you are going to save that data back out to a file later then I would highly recommend you pick one encoding and stick with it for storing all your data.

Handling inputs with unsupported and/or multiple charsets in Java

Question

1 answers

solution1
3 ACCPTED 2013-02-26 14:12:05

Handling inputs with unsupported and/or multiple charsets in Java

Question

1 answers

solution1 3 ACCPTED 2013-02-26 14:12:05

solution1
3 ACCPTED 2013-02-26 14:12:05