I have a file which contains UTF-8 data. This file does not have any BOM (Byte order mark) nor any length/size information as prefix for each unicode word/line.
I want to read bytes (yes bytes!), from a given offset and length. If the API has functions like seek, read bytes, or read bytes from an offset, it would be really helpful.
Example Content - "100° Info", For this content length is 9, If i request to read 9 bytes, it should read everything. Currently it's reading only 8. It looks like the API is treating the Unicode character as 2 chars.
How to read the content correctly? Which API to use for the same?
But the Unicode character for degrees actually is two bytes when encoded as UTF-8. A degree symbol is represented by the bytes c2 b0
. You can use RandomAccessFile
in Java if you really want to read bytes at specific offsets in a file, but I doubt that's what you really want.
Probably the easiest way to do what it seems you want is to use a FileReader
and either read into an array of char of size 9, or read just 9 characters into a larger char array. For example:
try (Reader reader = new InputStreamReader(new FileInputStream(filename), "UTF-8")) {
char[] buffer = new char[1024];
reader.read(buffer, 0, 9);
}
I have a feeling you are confusing characters and bytes. The text 100° Info
has nine characters but that would be ten bytes due to the degrees symbol being stored as two bytes. If you read nine bytes you would miss the o
from Info
but this would still parse as a string since it's a single byte character.
You can of course read the content into a string and then use String.getBytes("UTF8") to get the bytes for a given string. This would return all 9 bytes in your outlined case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.