如何将UTF-8字节块转换为字符？

Question

I have a large UTF-8 input that is divided to 1-kB size chunks. 我有一个大的UTF-8输入，分为1 kB大小的块。 I need to process it using a method that accepts String. 我需要使用接受String的方法来处理它。 Something like: 就像是：

for (File file: inputs) {
     byte[] b = FileUtils.readFileToByteArray(file);
     String str = new String(b, "UTF-8");
     processor.process(str);
}

My problem is that I have no guarantee that any UTF-8 character is not split between two chunks. 我的问题是我无法保证任何UTF-8字符不会在两个块之间分割。 The result of running my code is that some lines end with '?', which corrupts my input. 运行我的代码的结果是某些行以'？'结尾，这会破坏我的输入。

What would be a good approach to solve this? 解决这个问题的好方法是什么？

Answer 1

If I understand correctly, you had a large text, which was encoded with UTF-8, then split into 1-kilobyte files. 如果我理解正确，你有一个大文本，用UTF-8编码，然后分成1千字节的文件。 Now you want to read the text back, but you are concerned that an encoded character might be split across file boundaries, and cause a UTF-8 decoding error. 现在您想要读回文本，但您担心编码字符可能会跨文件边界分割，并导致UTF-8解码错误。

The API is a bit dusty, but there is a SequenceInputStream that will create what appears to be a single InputStream from a series of sub-streams. API有点尘土飞扬，但有一个SequenceInputStream将创建一系列子流中看似单个InputStream东西。 Create one of these with a collection of FileInputStream instances, then create an InputStreamReader that decodes the stream of UTF-8 bytes to text for your application. 使用FileInputStream实例集合创建其中一个，然后创建一个InputStreamReader ，将UTF-8字节流解码为应用程序的文本。

如何将UTF-8字节块转换为字符？

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-01-18 01:22:11

如何将UTF-8字节块转换为字符？

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-01-18 01:22:11

解决方案1
2 已采纳 2016-01-18 01:22:11