简体   繁体   中英

How to convert String encoded in windows-1250/Cp1250 to utf-8?

As title say ... I read content from htto response



    InputStream is = response.getEntity().getContent();
    String cw = IOUtils.toString(is);
    byte[] b = cw.getBytes("Cp1250");
    String x = StringUtils.newStringUtf8(b);
    String content = new String(b, "UTF-8");

    System.out.println(content);

I have tried plenty of variations. I am little confused about what are correct encoding constants used as strings. windows-1250 or Cp1250. UTF-8 or utf-8 or utf8?

You seem to think that a String object has an encoding. That's not correct. An encoding is used as part of the translation from binary data (a byte[] or InputStream ) to text data (a String or char[] etc).

It's not clear what IOUtils.toString is doing, but it's almost certainly losing data or at least handling it inappropriately. If your data is originally in Windows-1250, then you should use an InputStreamReader wrapping the InputStream , specifying the charset in the InputStreamReader constructor call.

It's not clear where UTF-8 comes in - you might want to write out the data in UTF-8 afterwards, but the result of that would be byte[] , not a string.

You're converting backwards. You need to get the input data as a byte array and then use String(byteArray, "Cp1250") to create the String object. Then if you want UTF-8, use String.getBytes("UTF-8") .

Encoding have a canonical (unique) name and other varying names, and that case-insensitive. For instance "UTF-8" is the canonical name, but some java versions back it was "UTF8"; it got written more to the common usage. The same for "Windows-1250," which you might see also in HTML pages. "Cp1250" (Code-Page) is a java internal name.

In java byte[] is binary data, String (internally Unicode) is text. Conversion between both needs an encoding, often optional though, taking the operating system default.

byte, InputStream, OutputStream <-> String, char, Reader, Writer

String cw = IOUtils.toString(is, "UTF-8"); // InputStream is binary gives byte[], hence give encoding
byte[] b = cw.getBytes("Cp1250");
String x = new String(b, "Cp1250");
String content = s;

System.out.println(content);

To allow this universal (qua encoding) String, String internally uses char, UTF-16. String constants are stored in the .class file as UTF-8 (more compact).

Assuming Apache Commons IO, use one of the methods that specifies an encoding :

String cw = IOUtils.toString(is, "windows-1250");

All strings are implicitly UTF-16 in Java. Other encodings are generally represented using byte arrays.

I see better to use Scanner for reading in different charsets.

    FileInputStream is = new FileInputStream(fileOrPath);
    Scanner scanner = new Scanner(is, "cp1250");
    String out = scanner.next();

And method next() returns String value in charset of application.

Tested on "czech language" from "cp1250" to "UTF-8".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM