简体   繁体   中英

Convert Latin-1 content of InputStream into UTF-8 String

I need to convert the content of an InputStream into a String. The difficulty here is the input encoding, namely Latin-1. I tried several approaches and code snippets with String, getBytes, char[], etc. in order to get the encoding straight, but nothing seemed to work.

Finally, I came up with the working solution below. However, this code seems a little verbose to me, even for Java. So the question here is:

Is there a simpler and more elegant approach to achieve what is done here?

private String convertStreamToStringLatin1(java.io.InputStream is)
        throws IOException {

    String text = "";

    // setup readers with Latin-1 (ISO 8859-1) encoding
    BufferedReader i = new BufferedReader(new InputStreamReader(is, "8859_1"));

    int numBytes;
    CharBuffer buf = CharBuffer.allocate(512);
    while ((numBytes = i.read(buf)) != -1) {
        text += String.copyValueOf(buf.array(), 0, numBytes);
        buf.clear();
    }

    return text;
}

Firstly, a few criticisms of the approach you've taken already. You shouldn't unnecessarily use an NIO CharBuffer when you merely want a char[512] . You don't need to clear the buffer each iteration, either.

int numBytes;
final char[] buf = new char[512];
while ((numBytes = i.read(buf)) != -1) {
    text += String.copyValueOf(buf, 0, numBytes);
}

You should also know that just constructing a String with those arguments will have the same effect, as the constructor too copies the data.

The contents of the subarray are copied; subsequent modification of the character array does not affect the newly created string.


You can use a dynamic ByteArrayOutputStream which grows an internal buffer to accommodate all the data. You can then use the entire byte[] from toByteArray to decode into a String .

The advantage is that deferring decoding until the end avoids decoding fragments individually; while that may work for simple charsets like ASCII or ISO-8859-1, it will not work on multi-byte schemes like UTF-8 and UTF-16. This means it is easier to change the character encoding in the future, since the code requires no modification.

private static final String DEFAULT_ENCODING = "ISO-8859-1";

public static final String convert(final InputStream in) throws IOException {
  return convert(in, DEFAULT_ENCODING);
}

public static final String convert(final InputStream in, final String encoding) throws IOException {
  final ByteArrayOutputStream out = new ByteArrayOutputStream();
  final byte[] buf = new byte[2048];
  int rd;
  while ((rd = in.read(buf, 0, 2048) >= 0) {
    out.write(buf, 0, rd);
  }
  return new String(out.toByteArray(), 0, encoding);
}

I don't see how it could be much simpler. I did this a little different once.. if you already have a String, you can do this:

new String(originalString.getBytes(), "ISO-8859-1");

So something like this could also work:

BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
  sb.append(line + "\n");
}
is.close();
return new String(sb.toString().getBytes(), "ISO-8859-1");

EDIT: I should add, this is really just an alternative to your already working solution. When it comes to converting Streams in Java it won't be much simpler, so go for it. :)

If you don't want to plumb it yourself you could have a look at the apache commons io project, IOUtils.toString(InputStream input, String encoding) which seems to do what you want. I haven't tried that method myself but the java doc states " Get the contents of an InputStream as a String using the specified character encoding."

Guava 's IO package is really nice this way.

Files.toString(yourFile, CharSets.ISO_8859_1)

or from a stream

new String(ByteStreams.toByteArray(stream), CharSets.ISO_8859_1)

I just found out that this answer to the question Read/convert an InputStream to a String can be applied to my problem, please see the code below. Anyway, I very much appreciate the answers you've given so far.

private String convertStreamToString(InputStream is, String charsetName) {
    try {
        return new java.util.Scanner(is, charsetName).useDelimiter("\\A").next();
    } catch (java.util.NoSuchElementException e) {
        return "";
    }
}

So in order to encode from Latin-1, call it like this:

String message = convertStreamToString(is, "8859_1");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM