简体繁体 English

读取未知编码的文本行

[英]Reading lines of text in unknown encoding

原文 2011-07-06 04:11:25 3 1 java/ character-encoding/ decoding

I need to read a text file line by line, and apply to each of them several CharsetDecoders, in order.我需要逐行读取文本文件，并按顺序对每个文件应用几个 CharsetDecoders。 Actually, I first try to decode line as if it's an UTF8-encoded one, and fallback to one-byte charset if UTF8 CharsetDecoder raises MalformedInputException.实际上，我首先尝试将行解码为 UTF8 编码的行，如果 UTF8 CharsetDecoder 引发 MalformedInputException，则回退到单字节字符集。

However, if I use InputStreamReader with default or specified charset, readLine function silently replaces with '?'但是，如果我使用带有默认或指定字符集的 InputStreamReader，readLine function 会默默地替换为“？” all the bytes it thinks are invalid for the specified charset.它认为对指定字符集无效的所有字节。

I finally ended up writing my own function for reading lines, that reads from a stream byte by byte, seeks for line terminators and constructs lines.我终于写了自己的 function 来读取行，从 stream 逐字节读取，寻找行终止符并构造行。 But this way it appears terribly slow.但是这种方式看起来非常慢。

Is there any way to make Java to read lines without touching bytes?有没有办法让 Java 在不接触字节的情况下读取行？

UPDATE: I've found out that there are charsets in which all 256 bytes are valid, two of them line terminators.更新：我发现有所有 256 个字节都有效的字符集，其中两个是行终止符。 So it is possible to read raw byte stream line by line .因此可以line by line读取原始字节 stream 。 Examples of such charsets are:此类字符集的示例是：

IBM00858 IBM437 IBM775 IBM850 IBM852 IBM855 IBM860 IBM861 IBM862 IBM863 IBM865 IBM866 ISO-8859-1 ISO-8859-13 ISO-8859-15 ISO-8859-2 ISO-8859-4 ISO-8859-5 ISO-8859-9 KOI8-R KOI8-U windows-1256 IBM00858 IBM437 IBM775 IBM850 IBM852 IBM855 IBM860 IBM861 IBM862 IBM863 IBM865 IBM866 ISO-8859-1 ISO-8859-13 ISO-8859-15 ISO-8859-2 ISO-8859-4 ISO-8859-5 ISO-8859-9 KOI8-R KOI8-U windows-1256

The question is now closed.问题现已结束。

1 个解决方案

You can't use a reader class and not expecting it to decode the underlying byte stream.您不能使用阅读器 class 并且不期望它解码底层字节 stream。 If you have a file where each line is encoded in a different charset (?), then you'd better of devise a method of detecting the underlying character encoding.如果您有一个文件，其中每一行都以不同的字符集（？）编码，那么您最好使用 devise 一种检测底层字符编码的方法。 Perhaps you can use an encoding detector such as juniversalchardet .也许您可以使用诸如juniversalchardet 之类的编码检测器。