简体繁体中英

Dealing with incorrectly encoded UTF-16 (?) in Java

原文 2011-11-27 00:23:49 6 1 java/ string/ utf-8/ character-encoding

I'm doing some work on the common crawl dataset (a large web crawl) and I keep seeing a strange encoding schema I just can't work out how to deal with.

The pattern I'm seeing again and again is something like the sequence of bytes 50 6f 6b e9 6d 6f 6e which I'm guessing is meant to represent Pokémon .

Now encoding schemas aren't my strongest point, but I don't know of any encoding where it's valid to represent the é as just e9 .

It's a bit like [UTF-16][1] which would be fe ff 00 50 00 6f 00 6b 00 e9 00 6d 00 6f 00 6e

And it's definitely not UTF-8 which would be 50 6f 6b c3 a9 6d 6f 6e

So I'm just after a way in Java to decode these bytes into a string, a library would be ideal.

new String(bytes) justifiably doesn't work and is rightly converting the e9 to the replacement character ef bf bd (aka the dreaded )

Any ideas on how to handle these?

update

I've ended up using the character set encoding detector provided in Apache Tika [2]. Works well.

[1] http://www.fileformat.info/info/unicode/char/e9/index.htm

[2] http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/CharsetDetector.html

1 answers

That's either ISO-8859-1 or Windows-1252 , the latter being essentially a superset of the former. Use either new String(bytes, "ISO-8859-1") or new String(bytes, "Windows-1252") .

How to validate and parse an UTF-16 encoded XML file in Java?

Extracting UTF-16 encoded file from ZIP archive in Java

Can any character be encoded in UTF-16 (using Java 8)

UTF-16 to String in Java

Why don't UTF-8 and UTF-16 encoded Strings print the same in Java?

Java UTF-16 conversion to UTF-8

UTF-8 and UTF-16 in Java

Convert a UTF-32 encoded string (C style) in a UTF-16 (JSON style) encoded one in Java/Clojure

UTF-16 Character Encoding of java

java encode charset on UTF-16

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to validate and parse an UTF-16 encoded XML file in Java? Extracting UTF-16 encoded file from ZIP archive in Java Can any character be encoded in UTF-16 (using Java 8) UTF-16 to String in Java Why don't UTF-8 and UTF-16 encoded Strings print the same in Java? Java UTF-16 conversion to UTF-8 UTF-8 and UTF-16 in Java Convert a UTF-32 encoded string (C style) in a UTF-16 (JSON style) encoded one in Java/Clojure UTF-16 Character Encoding of java java encode charset on UTF-16

Related Tags

Dealing with incorrectly encoded UTF-16 (?) in Java

Question

1 answers

solution1 7 ACCPTED 2011-11-27 00:28:28

solution1
7 ACCPTED 2011-11-27 00:28:28