简体   繁体   中英

Is there a way to find file encoding type (UTF-8 or ANSI or Cp1252 or others) using java

I have to read few html files. If i use UTF-8 as charset for reading and writing a file, there are some junk characters getting displayed in html page. It seems the actual file is ANSI encoded since i am using UTF-8 for reading and writing the file, few white spaces are displayed as black diamond with question mark.

Is there a way to find the encoding/charset to be used to read/write a particular file?

No, that's mathematically impossible. Files are just bags of bytes, and most encodings are such that any byte has meaning. Short of using an artificial intelligence getup that analyses how likely it is (looking for words that mix characters from different unicode planes and the like) that you read it using the right encoding, there is therefore no way to be sure.

Some files can be conclusively determined to definitely not be UTF_8 (or, to be corrupted), because there are certain byte sequences that cannot appear in the bytestream that results when you UTF-8 encode some characters. However, this isn't very useful either: You cannot conclude: Oh. Must be UTF-8! based on the lack of these invalid sequences.

You have some options

The right way

When you saved those HTML files, that is when encoding was either chosen (the HTML was received from the webserver and loaded into browser memory, and has been decoded from bytes to chars using the charset listed in the HTTP response header 'Content-Type', then you asked the browser to save it to a file, at which point the browser needs to choose an encoding), or it was known (the tool used to save the HTML saves the HTML 'raw', straight as it was sent over the HTTP connection, but as part of doing this, this tool knows the encoding, as the HTTP server sent it in the 'Content-Type' header), and therefore that was the perfect time to store this information, or to choose a well known encoding (UTF-8 is a good idea).

So, go back to whichever software and/or process managed to save these files and fix it at the source: Either also save the encoding, or, ensure that the HTML file is saved in UTF-8 no matter what the HTTP server you got this HTML from sent it as.

The hacky way

Grab a magnifying glass, put on your finest hat, and get your sherlock holmes on.

The usual strategy is to open a hex editor and travel to the position in the file where you see diamonds or unexpected characters and look at the byte sequence. Especially if it is a somewhat 'well known' western non-ASCII character like é or ö, odds are that doing a web search for the byte(s) you see there, usually you'll find it. Look for the ones with decimal value 128 or higher, in hex, the ones that start with an 8, 9, or a letter - because the ones below that are ASCII and almost all encodings encode those the same way, thus, not useful to differentiate encodings.

For example, if you search for 0xE1 0xBA 0x9E the first hit leads you to this page , scrolling down to 0xe1 0xBA 0x9e it says: That's the UTF-8 version of codepoint 1E9E, the sharp s (ß - common in german). If that makes sense in the text, we figured it out. We will need an AI to do text analysis to figure out if it makes sense. I don't have one, so we'll need an artificial artificial intelligence. In other words, your brain will have to do the job. Just look at it: If, after substituting an ß, the text says Last Name: Boßler , you obviously got it - Boßler is a german last name, as well as a mountain in germany. Web Searching again to the rescue if you are not sure.

Sometimes you have to figure out what character it was supposed to be, and include this in the search. For example, if you check the file and you see a 0xDF and you know a ß has to be there, search for 0xDF ß and you get to this page which shows a ton of encodings and how they store ß. Only a few store it as 0xDF: It's some ISO-8859 variant, or a Cp-125x variant (aka windows-125x) and you've managed to exclude IBM852. There's no way to know which ISO-8859 or Cp-125 variant it actually is; you'll need more weird characters and hope you hit one where you know what it is supposed to be and these chars are encoded differently between them (unlikely; they are very similar).

Most likely in the end you end up knowing that it is one of a few encodings, because usually there are multiple encodings that would all result in the exact same byte sequence. In fact, if you have all-ASCII characters, there are thousands of encodings that it could be.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM