简体   繁体   中英

Dealing with incorrectly encoded UTF-16 (?) in Java

I'm doing some work on the common crawl dataset (a large web crawl) and I keep seeing a strange encoding schema I just can't work out how to deal with.

The pattern I'm seeing again and again is something like the sequence of bytes 50 6f 6b e9 6d 6f 6e which I'm guessing is meant to represent Pokémon .

Now encoding schemas aren't my strongest point, but I don't know of any encoding where it's valid to represent the é as just e9 .

It's a bit like [UTF-16][1] which would be fe ff 00 50 00 6f 00 6b 00 e9 00 6d 00 6f 00 6e

And it's definitely not UTF-8 which would be 50 6f 6b c3 a9 6d 6f 6e

So I'm just after a way in Java to decode these bytes into a string, a library would be ideal.

new String(bytes) justifiably doesn't work and is rightly converting the e9 to the replacement character ef bf bd (aka the dreaded )

Any ideas on how to handle these?

update

I've ended up using the character set encoding detector provided in Apache Tika [2]. Works well.

[1] http://www.fileformat.info/info/unicode/char/e9/index.htm

[2] http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/CharsetDetector.html

That's either ISO-8859-1 or Windows-1252 , the latter being essentially a superset of the former. Use either new String(bytes, "ISO-8859-1") or new String(bytes, "Windows-1252") .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM