[英]How to convert chunks of UTF-8 bytes to charcters?
I have a large UTF-8 input that is divided to 1-kB size chunks. 我有一个大的UTF-8输入,分为1 kB大小的块。 I need to process it using a method that accepts String.
我需要使用接受String的方法来处理它。 Something like:
就像是:
for (File file: inputs) {
byte[] b = FileUtils.readFileToByteArray(file);
String str = new String(b, "UTF-8");
processor.process(str);
}
My problem is that I have no guarantee that any UTF-8 character is not split between two chunks. 我的问题是我无法保证任何UTF-8字符不会在两个块之间分割。 The result of running my code is that some lines end with '?', which corrupts my input.
运行我的代码的结果是某些行以'?'结尾,这会破坏我的输入。
What would be a good approach to solve this? 解决这个问题的好方法是什么?
If I understand correctly, you had a large text, which was encoded with UTF-8, then split into 1-kilobyte files. 如果我理解正确,你有一个大文本,用UTF-8编码,然后分成1千字节的文件。 Now you want to read the text back, but you are concerned that an encoded character might be split across file boundaries, and cause a UTF-8 decoding error.
现在您想要读回文本,但您担心编码字符可能会跨文件边界分割,并导致UTF-8解码错误。
The API is a bit dusty, but there is a SequenceInputStream
that will create what appears to be a single InputStream
from a series of sub-streams. API有点尘土飞扬,但有一个
SequenceInputStream
将创建一系列子流中看似单个InputStream
东西。 Create one of these with a collection of FileInputStream
instances, then create an InputStreamReader
that decodes the stream of UTF-8 bytes to text for your application. 使用
FileInputStream
实例集合创建其中一个,然后创建一个InputStreamReader
,将UTF-8字节流解码为应用程序的文本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.