简体   繁体   English

从Java文件中的随机位置读取字符?

[英]Reading a character at random place from file in java?

When reading from a file using readChar() in RandomAccessFile class, unexpected output comes. 当使用RandomAccessFile类中的readChar()读取文件时,出现意外的输出。 Instead of the desired character ? 而不是所需的字符? is displayed. 被陈列。

package tesr;
import java.io.RandomAccessFile;
import java.io.IOException;

public class Test {

    public static void main(String[] args)  {
        try{
            RandomAccessFile f=new RandomAccessFile("c:\\ankit\\1.txt","rw");
            f.seek(0);
            System.out.println(f.readChar());
        }
        catch(IOException e){
            System.out.println("dkndknf");
        }
    // TODO Auto-generated method stub

}

} }

You probably intended readByte . 您可能打算使用readByte Java char is UTF-16BE, a 2 bytes Unicode representation, and on random binary data very often not representable, no correct UTF-16BE or a half "surrogate" - part of a combination of two char forming one Unicode code point. Java char是UTF-16BE,是2字节Unicode表示形式,并且在随机二进制数据上通常是无法表示的,没有正确的UTF-16BE或一半“替代”-组成一个Unicode代码点的两个char的组合的一部分。 Java represents a failed conversion in your case as question mark. 在您的情况下,Java表示失败的转换为问号。

If you know in what encoding the file is in, then for a single byte encoding it is simple: 如果您知道文件的编码格式,那么对于单字节编码来说很简单:

byte b = in.readByte();
byte[] bs = new byte[] { b };
String s = new String(bs, "Cp1252"); // Some single byte encoding

For the variable multi-byte UTF-8 it is also simple to identify a sequence of bytes: 对于可变多字节UTF-8,识别字节序列也很简单:

  • single byte when high bit = 0 高位= 0时为单字节
  • otherwise a continuation byte when high bits 10 否则为高字节10时的连续字节
  • otherwise a starting byte (with some special cases) telling the number of bytes by its high bits. 否则为一个起始字节(在某些特殊情况下),以高位字节表示字节数。

For UTF-16LE and UTF-16BE the file positions must be a multiple of 2 and 2 bytes long. 对于UTF-16LE和UTF-16BE,文件位置必须是2和2个字节长的倍数。

byte[] bs = new byte[2];
in.read(bs);
String s = new String(bs, StandardCharsets.UTF_16LE);

You almost certainly have a character encoding problem. 您几乎肯定会遇到字符编码问题。 It is not possible to simply read characters from a file. 无法简单地从文件中读取字符。 What must be done is that an appropriate sequence of bytes are read, then those bytes are interpreted according to a character encoding scheme to translate them to a character. 必须要做的是读取适当的字节序列,然后根据字符编码方案解释这些字节,以将转换为字符。 When you want to read a file as text, Java must be told, perhaps implicitly, which character encoding to use. 当您想以文本形式读取文件时,必须(可能是隐式)告诉Java使用哪种字符编码。

If you tell Java the wrong encoding you will get gibberish. 如果您告诉Java错误的编码,您将变得乱码。 If you pick an arbitrary point in a file and start reading, and that location is not the start of the encoding of a character, you will get gibberish. 如果您在文件中选择一个任意点并开始读取,而该位置不是字符编码的开始,则会出现乱码。 One or both of those has happened in your case. 您遇到的情况之一或全部都发生了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM