简体   繁体   中英

Java: read bytes from a utf-8 file

I have a file which contains UTF-8 data. This file does not have any BOM (Byte order mark) nor any length/size information as prefix for each unicode word/line.

I want to read bytes (yes bytes!), from a given offset and length. If the API has functions like seek, read bytes, or read bytes from an offset, it would be really helpful.

Example Content - "100° Info", For this content length is 9, If i request to read 9 bytes, it should read everything. Currently it's reading only 8. It looks like the API is treating the Unicode character as 2 chars.

How to read the content correctly? Which API to use for the same?

But the Unicode character for degrees actually is two bytes when encoded as UTF-8. A degree symbol is represented by the bytes c2 b0 . You can use RandomAccessFile in Java if you really want to read bytes at specific offsets in a file, but I doubt that's what you really want.

Probably the easiest way to do what it seems you want is to use a FileReader and either read into an array of char of size 9, or read just 9 characters into a larger char array. For example:

try (Reader reader = new InputStreamReader(new FileInputStream(filename), "UTF-8")) {
    char[] buffer = new char[1024];
    reader.read(buffer, 0, 9);
}

I have a feeling you are confusing characters and bytes. The text 100° Info has nine characters but that would be ten bytes due to the degrees symbol being stored as two bytes. If you read nine bytes you would miss the o from Info but this would still parse as a string since it's a single byte character.

You can of course read the content into a string and then use String.getBytes("UTF8") to get the bytes for a given string. This would return all 9 bytes in your outlined case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM