简体   繁体   中英

Reading lines of text in unknown encoding

I need to read a text file line by line, and apply to each of them several CharsetDecoders, in order. Actually, I first try to decode line as if it's an UTF8-encoded one, and fallback to one-byte charset if UTF8 CharsetDecoder raises MalformedInputException.

However, if I use InputStreamReader with default or specified charset, readLine function silently replaces with '?' all the bytes it thinks are invalid for the specified charset.

I finally ended up writing my own function for reading lines, that reads from a stream byte by byte, seeks for line terminators and constructs lines. But this way it appears terribly slow.

Is there any way to make Java to read lines without touching bytes?

UPDATE: I've found out that there are charsets in which all 256 bytes are valid, two of them line terminators. So it is possible to read raw byte stream line by line . Examples of such charsets are:

IBM00858 IBM437 IBM775 IBM850 IBM852 IBM855 IBM860 IBM861 IBM862 IBM863 IBM865 IBM866 ISO-8859-1 ISO-8859-13 ISO-8859-15 ISO-8859-2 ISO-8859-4 ISO-8859-5 ISO-8859-9 KOI8-R KOI8-U windows-1256

The question is now closed.

You can't use a reader class and not expecting it to decode the underlying byte stream. If you have a file where each line is encoded in a different charset (?), then you'd better of devise a method of detecting the underlying character encoding. Perhaps you can use an encoding detector such as juniversalchardet .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM