简体   繁体   English

如何将UTF-8字节块转换为字符?

[英]How to convert chunks of UTF-8 bytes to charcters?

I have a large UTF-8 input that is divided to 1-kB size chunks. 我有一个大的UTF-8输入,分为1 kB大小的块。 I need to process it using a method that accepts String. 我需要使用接受String的方法来处理它。 Something like: 就像是:

for (File file: inputs) {
     byte[] b = FileUtils.readFileToByteArray(file);
     String str = new String(b, "UTF-8");
     processor.process(str);
}

My problem is that I have no guarantee that any UTF-8 character is not split between two chunks. 我的问题是我无法保证任何UTF-8字符不会在两个块之间分割。 The result of running my code is that some lines end with '?', which corrupts my input. 运行我的代码的结果是某些行以'?'结尾,这会破坏我的输入。

What would be a good approach to solve this? 解决这个问题的好方法是什么?

If I understand correctly, you had a large text, which was encoded with UTF-8, then split into 1-kilobyte files. 如果我理解正确,你有一个大文本,用UTF-8编码,然后分成1千字节的文件。 Now you want to read the text back, but you are concerned that an encoded character might be split across file boundaries, and cause a UTF-8 decoding error. 现在您想要读回文本,但您担心编码字符可能会跨文件边界分割,并导致UTF-8解码错误。

The API is a bit dusty, but there is a SequenceInputStream that will create what appears to be a single InputStream from a series of sub-streams. API有点尘土飞扬,但有一个SequenceInputStream将创建一系列子流中看似单个InputStream东西。 Create one of these with a collection of FileInputStream instances, then create an InputStreamReader that decodes the stream of UTF-8 bytes to text for your application. 使用FileInputStream实例集合创建其中一个,然后创建一个InputStreamReader ,将UTF-8字节流解码为应用程序的文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM