简体   繁体   English

如何使用 RandomAccessFile 读取 UTF8 编码的文件?

[英]How to read UTF8 encoded file using RandomAccessFile?

I have text file that was encoded with UTF8 (for language specific characters).我有用 UTF8 编码的文本文件(用于语言特定字符)。 I need to use RandomAccessFile to seek specific position and read from.我需要使用 RandomAccessFile 来寻找特定位置并从中读取。

I want read line-by-line.我想逐行阅读。

String str = myreader.readLine(); //returns wrong text, not decoded 
String str myreader.readUTF(); //An exception occurred: java.io.EOFException

You can convert string, read by readLine to UTF8, using following code:您可以使用以下代码将 readLine 读取的字符串转换为 UTF8:

public static void main(String[] args) throws IOException {
    RandomAccessFile raf = new RandomAccessFile(new File("MyFile.txt"), "r");
    String line = raf.readLine();
    String utf8 = new String(line.getBytes("ISO-8859-1"), "UTF-8");
    System.out.println("Line: " + line);
    System.out.println("UTF8: " + utf8);
}

Content of MyFile.txt: (UTF-8 Encoding) MyFile.txt 的内容:(UTF-8 编码)

Привет из Украины

Console output:控制台输出:

Line: ÐÑÐ¸Ð²ÐµÑ Ð¸Ð· УкÑаинÑ
UTF8: Привет из Украины

The API docs say the following for readUTF8 API 文档对 readUTF8 说明如下

Reads in a string from this file.从此文件中读入一个字符串。 The string has been encoded using a modified UTF-8 format.该字符串已使用修改后的 UTF-8 格式进行编码。

The first two bytes are read, starting from the current file pointer, as if by readUnsignedShort.从当前文件指针开始读取前两个字节,就像通过 readUnsignedShort 一样。 This value gives the number of following bytes that are in the encoded string, not the length of the resulting string.该值给出了编码字符串中的后续字节数,而不是结果字符串的长度。 The following bytes are then interpreted as bytes encoding characters in the modified UTF-8 format and are converted into characters.然后将以下字节解释为修改后的 UTF-8 格式中的字节编码字符并转换为字符。

This method blocks until all the bytes are read, the end of the stream is detected, or an exception is thrown.此方法会阻塞,直到读取所有字节、检测到流的结尾或引发异常。

Is your string formatted in this way?你的字符串是这样格式化的吗?

This appears to explain your EOF exceptuon.这似乎可以解释您的 EOF 异常。

Your file is a text file so your actual problem is the decoding.您的文件是文本文件,因此您的实际问题是解码。

The simplest answer I know is:我知道的最简单的答案是:

try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("jedis.txt"),"UTF-8"))){

    String line = null;
    while( (line = reader.readLine()) != null){
        if(line.equals("Obi-wan")){
            System.out.println("Yay, I found " + line +"!");
        }
    }
}catch(IOException e){
    e.printStackTrace();
}

Or you can set the current system encoding with the system property file.encoding to UTF-8.或者,您可以使用系统属性file.encoding将当前系统编码设置为 UTF-8。

java -Dfile.encoding=UTF-8 com.jediacademy.Runner arg1 arg2 ...

You may also set it as a system property at runtime with System.setProperty(...) if you only need it for this specific file, but in a case like this I think I would prefer the OutputStreamWriter .您也可以在运行时使用System.setProperty(...)将其设置为系统属性,如果您只需要此特定文件,但在这种情况下,我想我更喜欢OutputStreamWriter

By setting the system property you can use FileReader and expect that it will use UTF-8 as the default encoding for your files.通过设置系统属性,您可以使用FileReader并期望它将使用 UTF-8 作为您的文件的默认编码。 In this case for all the files that you read and write.在这种情况下,对于您读取和写入的所有文件。

If you intend to detect decoding errors in your file you would be forced to use the InputStreamReader approach and use the constructor that receives an decoder.如果您打算检测文件中的解码错误,您将被迫使用InputStreamReader方法并使用接收解码器的构造函数。

Somewhat like有点像

CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
BufeferedReader out = new BufferedReader(new InpuStreamReader(new FileInputStream("jedis.txt),decoder));

You may choose between actions IGNORE | REPLACE | REPORT您可以在操作之间进行选择IGNORE | REPLACE | REPORT IGNORE | REPLACE | REPORT

EDIT编辑

If you insist in using RandomAccessFile , you would need to know the exact offset of the line that you are intending to read.如果您坚持使用RandomAccessFile ,则需要知道要读取的行的确切偏移量。 And not only that, in order to read with readUTF() method, you should have written the file with writeUTF() method.不仅如此,为了使用readUTF()方法读取,您应该使用writeUTF()方法编写文件。 Because this method, as JavaDocs stated above, expects a specific formatting in which the first 2 unsigned bytes represent the length in bytes of the UTF-8 string.因为这个方法,正如上面提到的 JavaDocs,需要一个特定的格式,其中前 2 个无符号字节表示 UTF-8 字符串的长度(以字节为单位)。

As such, if you do:因此,如果您这样做:

try(RandomAccessFile raf = new RandomAccessFile("jedis.bin", "rw")){

    raf.writeUTF("Luke\n"); //2 bytes for length + 5 bytes
    raf.writeUTF("Obiwan\n"); //2 bytes for length + 7 bytes
    raf.writeUTF("Yoda\n"); //2 bytes for lenght + 5 bytes

}catch(IOException e){
    e.printStackTrace();
}

You should not have any problems reading back from this file using the method readUTF() , as long as you can determine the offset of the given line that you want to read back.使用readUTF()方法从该文件读回应该不会有任何问题,只要您可以确定要读回的给定行的偏移量。

If you'd open the file jedis.bin you would notice it is a binary file , not a text file.如果您打开文件jedis.bin您会注意到它是一个二进制文件,而不是一个文本文件。

Now, I know that "Luke\\n" is 5 bytes in UTF-8 and "Obiwan\\n" is 7 bytes in UTF-8.现在,我知道"Luke\\n"在 UTF-8 中是 5 个字节,而"Obiwan\\n"在 UTF-8 中是 7 个字节。 And that the writeUTF() method will insert 2 bytes in front of every one of these strings.并且writeUTF()方法将在这些字符串中的每一个之前插入 2 个字节。 Therefore, before "Yoda\\n" there are (5+2) + (7+2) = 16 bytes.因此,在"Yoda\\n"之前有 (5+2) + (7+2) = 16 个字节。

So, I could do something like this to reach the last line:所以,我可以做这样的事情来到达最后一行:

try (RandomAccessFile raf = new RandomAccessFile("jedis.bin", "r")) {

    raf.seek(16);
    String val = raf.readUTF();
    System.out.println(val); //prints Yoda

} catch (IOException e) {
    e.printStackTrace();
}

But this will not work if you wrote the file with a Writer class because writers do not follow the formatting rules of the method writeUFT() .但是,如果您使用Writer类编写文件,这将不起作用,因为编写者不遵循方法writeUFT()的格式规则。

In a case like this, the best would be that your binary file would be formatted in such a way that all strings occupied the same amount of space (number of bytes, not number of characteres, because the number of bytes is variable in UTF-8 depending on the characters in your String), if not all the space is need it you pad it:在这种情况下,最好的办法是将二进制文件的格式设置为所有字符串占用相同的空间量(字节数,而不是字符数,因为字节数在 UTF- 8 取决于你的字符串中的字符),如果不是所有的空间都需要它,你填充它:

That way you could easily calculate the offset of a given line because they all would occupy the same amount of space.这样您就可以轻松计算给定线的偏移量,因为它们都将占用相同的空间。

You aren't going to be able to go at it this way.你将无法以这种方式进行。 The seek function will position you by some number of bytes. seek功能将按一定数量的字节定位您。 There is no guarantee that you are aligned to a UTF-8 character boundary.无法保证您与 UTF-8 字符边界对齐。

Once you are positioned on a given line (this means you have answered the first part of your problem, see @martinjs answer), you can read the whole line and make a String out of it using a statement given in the answer by @Matthieu.一旦您定位在给定的行上(这意味着您已经回答了问题的第一部分,请参阅@martinjs 答案),您可以阅读整行并使用@Matthieu 的答案中给出的语句从中创建一个String . But to check if the statement in question is correct, we have to ask ourselves 4 questions.但是要检查有问题的陈述是否正确,我们必须问自己 4 个问题。 It is not self-evident.这不是不言而喻的。

Note that the problem of getting at the start of a line may require to analyze the text to build an index if you need to randomly and quickly access many lines.请注意,如果您需要随机快速访问多行,则获取行首的问题可能需要分析文本以构建索引。

The statement to read a line and turn it into a String is :读取一行并将其转换为String的语句是:

String utf8 = new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8");
  1. What is a byte in UTF-8 ?什么是 UTF-8 中的字节? That means which values are allowed.这意味着允许哪些值。 We'll see the question is in fact useless once we answer question 2.一旦我们回答了问题 2,我们就会发现这个问题实际上毫无用处。
  2. readLine() . readLine() UTF-8 bytes → UTF-16 bytes ok ? UTF-8 字节 → UTF-16 字节可以吗? Yes.是的。 Because UTF-16 gives a meaning to all the integers from 0 to 255 coded on 2 bytes if the most significant byte (MSB) is 0. This is guaranteed by readLine() .因为如果最高有效字节 (MSB) 为 0,则 UTF-16 会赋予从 0 到 255 的所有整数以 2 个字节编码的含义。 readLine()保证了这一点。
  3. getBytes("ISO-8859-1") . getBytes("ISO-8859-1") Characters encoded in UTF-16 (Java String with 1 or 2 char (code unit) per character) → ISO-8859-1 bytes ok ?以 UTF-16 编码的char (每个字符有 1 或 2 个char (代码单元)的 Java String )→ ISO-8859-1 字节好吗? Yes.是的。 The code points of the characters in the Java string are ≤ 255 and ISO-8859-1 is a "raw" encoding which means it can encode every character as a single byte. Java 字符串中字符的代码点≤ 255,ISO-8859-1 是“原始”编码,这意味着它可以将每个字符编码为单个字节。
  4. new String(..., "UTF-8") . new String(..., "UTF-8") ISO-8859-1 bytes → UTF-8 bytes ok ? ISO-8859-1 字节 → UTF-8 字节可以吗? Yes.是的。 Since the original bytes come from UTF-8 encoded text and have been extracted as is, they still represent text encoded in UTF-8.由于原始字节来自 UTF-8 编码文本并已按原样提取,因此它们仍代表以 UTF-8 编码的文本。

Concerning the raw nature of ISO-8859-1 in which every byte (value 0 to 255) is mapped onto a character, I copy/paste below the comment I made on the answer by @Matthieu.关于 ISO-8859-1 的原始性质,其中每个字节(值 0 到 255)都映射到一个字符上,我在@Matthieu 对答案的评论下方复制/粘贴。

See this question concerning the notion of "raw" encoding with ISO-8859-1.请参阅有关 ISO-8859-1 的“原始”编码概念的问题 Note the difference between ISO/IEC 8859-1 (191 bytes defined) and ISO-8859-1 (256 bytes defined).请注意 ISO/IEC 8859-1(定义了 191 个字节)和 ISO-8859-1(定义了 256 个字节)之间的区别。 You can find the definition of ISO-8859-1 in RFC1345 and see that control codes C0 and C1 are mapped onto the 65 unused bytes of ISO/IEC 8859-1.您可以在RFC1345 中找到 ISO-8859-1 的定义,并看到控制代码 C0 和 C1 被映射到 ISO/IEC 8859-1 的 65 个未使用字节上。

I realise that this is an old question, but it still seems to have some interest, and no accepted answer.我意识到这是一个老问题,但它似乎仍然有一些兴趣,并且没有被接受的答案。

What you are describing is essentially a data structures problem.你所描述的本质上是一个数据结构问题。 The discussion of UTF8 here is a red herring - you would face the same problem using a fixed length encoding such as ASCII, because you have variable length lines.此处对 UTF8 的讨论是一个红鲱鱼 - 使用固定长度编码(如 ASCII)时,您会遇到同样的问题,因为您有可变长度的行。 What you need is some kind of index.你需要的是某种索引。

If you absolutely can't change the file itself (the "string file") - as seems to be the case - you could always construct an external index.如果您绝对不能更改文件本身(“字符串文件”) - 似乎是这种情况 - 您总是可以构建一个外部索引。 The first time (and only the first time) the string file is accessed, you read it all the way through (sequentially), recording the byte position of the start of every line, and finishing by recording the end-of-file position (to make life simpler).字符串文件被访问的第一次(也是唯一的一次),你看它一路走过(按顺序),记录每一行的起始字节位置,并通过记录档案结尾的位置整理(让生活更简单)。 This can be achieved by the following code:这可以通过以下代码实现:

myList.add(0); // assuming first string starts at beginning of file
while ((line = myRandomAccessFile.readLine()) != null) {
    myList.add(myRandomAccessFile.getFilePointer());
}

You then write these integers into a separate file ("index file"), which you will read back in every subsequent time you start your program and intend to access the string file.然后您将这些整数写入一个单独的文件(“索引文件”),您将在以后每次启动程序并打算访问字符串文件时读回该文件。 To access the n th string, pick the n th and n+1 th index from the index file (call these A and B ).要访问第n个字符串,请从索引文件中选择第n个和第n+1个索引(称为AB )。 You then seek to position A in the string file and read BA bytes, which you then decode from UTF8.然后您在字符串文件中寻找A的位置并读取BA字节,然后从 UTF8 解码。 For instance, to get line i :例如,要获得第i行:

myRandomAccessFile.seek(myList.get(i));
byte[] bytes = new byte[myList.get(i+1) - myList.get(i)];
myRandomAccessFile.readFully(bytes);
String result = new String(bytes, "UTF-8");

In many cases, however, it would be better to use a database such as SQLite, which creates and maintains the index for you.但是,在许多情况下,最好使用 SQLite 之类的数据库,它会为您创建和维护索引。 That way, you can add and modify extra "lines" without having to recreate the entire index.这样,您可以添加和修改额外的“行”,而无需重新创建整个索引。 See https://www.sqlite.org/cvstrac/wiki?p=SqliteWrappers for Java implementations.有关 Java 实现,请参阅https://www.sqlite.org/cvstrac/wiki?p=SqliteWrappers

Reading the file via readLine() worked for me:通过 readLine() 读取文件对我有用:

RandomAccessFile raf = new RandomAccessFile( ... );
String line;
while ((line = raf.readLine()) != null) { 
    String utf = new String(line.getBytes("ISO-8859-1"));
    ...
}

// my file content has been created with:
raf.write(myStringContent.getBytes());

The readUTF() method of RandomAccessFile treats first two bytes from the current pointer as size of bytes, after the two bytes from current position, to be read and returned as string. RandomAccessFile 的 readUTF() 方法将当前指针的前两个字节视为字节大小,在当前位置的两个字节之后,将被读取并作为字符串返回。

In order for this method to work, content should be written using writeUTF() method as it uses first two bytes after the current position for saving the content size and then writes the content.为了使此方法起作用,应使用 writeUTF() 方法写入内容,因为它使用当前位置后的前两个字节来保存内容大小,然后写入内容。 Otherwise, most of the times you will get EOFException.否则,大多数时候你会得到 EOFException。

See http://www.zoftino.com/java-random-access-files for details.有关详细信息,请参阅http://www.zoftino.com/java-random-access-files

I find the API for RandomAccessFile is challenging.我发现RandomAccessFile的 API 很有挑战性。

If your text is actually limited to UTF-8 values 0-127 (the lowest 7 bits of UTF-8), then it is safe to use readLine() , but read those Javadocs carefully: That is one strange method.如果您的文本实际上仅限于 UTF-8 值 0-127(UTF-8 的最低 7 位),那么使用readLine()是安全的,但请仔细阅读这些 Javadoc:这是一种奇怪的方法。 To quote:报价:

This method successively reads bytes from the file, starting at the current file pointer, until it reaches a line terminator or the end of the file.此方法从文件中连续读取字节,从当前文件指针开始,直到到达行终止符或文件末尾。 Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero.通过取字符低八位的字节值并将字符的高八位设置为零,将每个字节转换为字符。 This method does not, therefore, support the full Unicode character set.因此,此方法不支持完整的 Unicode 字符集。

To read UTF-8 safely, I suggest you read (some or all of the) raw bytes with a combination of length() and read(byte[]) .为了安全地读取 UTF-8,我建议您使用length()read(byte[])的组合读取(部分或全部)原始字节。 Then convert your UTF-8 bytes to a Java String with this constructor: new String(byte[], "UTF-8") .然后使用以下构造函数将您的 UTF-8 字节转换为 Java Stringnew String(byte[], "UTF-8")

To write UTF-8 safely, first convert your Java String to the correct bytes with someText.getBytes("UTF-8") .要安全地编写 UTF-8,首先使用someText.getBytes("UTF-8")将您的 Java String转换为正确的字节。 Finally, write the bytes using write(byte[]) .最后,使用write(byte[])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM