UTF8 验证可以在 char[] 上完成，还是必须在原始 byte[] 上完成？

Question

I am attempting to validate that files I am ingesting are all strictly UTF8 compliant, and through my several readings, I have come to the conclusion that if the validation is to be done correctly, the original, untampered bytes of the data must be analyzed.我正在尝试验证我正在摄取的文件是否都严格符合 UTF8 标准，并且通过多次阅读，我得出的结论是，如果要正确完成验证，则必须分析原始的、未被篡改的数据字节。 If one attempts to look at the string itself after the fact, they are unlikely to find if any characters are non-UTF8 compliant, as Java will attempt to convert them.如果有人试图事后查看字符串本身，他们不太可能发现是否有任何字符不符合 UTF8 标准，因为 Java 将尝试转换它们。

I am reading the files normally: I receive an InputStream from the file, and then feed to it an InputStreamReader , then feed that to BufferedReader .我正在正常读取文件：我从文件中接收到一个InputStream ，然后将一个InputStreamReader提供给它，然后将其提供给BufferedReader 。 It would look something like:它看起来像：

InputStream is = new FileInputStream(fileLocation);
InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8)));
BufferedReader br = new BufferedReader(isr);

I can override the BufferedReader class to add some validation for each character it stumbles across.我可以覆盖BufferedReader class 来为它偶然发现的每个字符添加一些验证。

The issue is that BufferedReader has a char[] , not a byte[] , for the buffer.问题是BufferedReader有一个char[] ，而不是byte[] ，用于缓冲区。 That means the bytes get auto-converted to chars.这意味着字节会自动转换为字符。

So, my question is: can this validation be done at the char[] level located in BufferedReader?所以，我的问题是：这个验证可以在 BufferedReader 中的char[]级别完成吗？ Although I am somewhat "enforcing" UTF8 here:虽然我在这里有点“强制执行”UTF8：

InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8)));

I am seeing characters get transformed from non utf-8 (like, say, utf-16) to utf-8, and breaking some systems.我看到字符从非 utf-8（例如，utf-16）转换为 utf-8，并破坏了一些系统。 I don't know that the char[] is basically "too late" for this validation.我不知道char[]对于这个验证来说基本上是“太晚了”。 Is it truly?是真的吗？

Answer 1

Define UTF-8 compliant.定义 UTF-8 兼容。 There are 2 events that you can reasonably call 'invalid'.有 2 个事件可以合理地称为“无效”。 UTF-8 as a format converts 32-bit numbers into byte sequences, and can't convert just any number, only limited sets (but all numbers that could possibly come up in unicode can be converted). UTF-8 作为一种格式将 32 位数字转换为字节序列，并且不能转换任何数字，只能转换有限的集合（但可以转换所有可能出现在 unicode 中的数字）。

A valid conversion for a non-existing glyph.对不存在的字形的有效转换。

Not every single one of the 32-bit numbers that UTF-8 can store actually are a valid unicode codepoint.并非 UTF-8 可以存储的 32 位数字中的每一个实际上都是有效的 unicode 代码点。 However, unicode expands all the time.但是，unicode 一直在扩展。 What isn't valid today might be valid tomorrow.今天无效的内容明天可能有效。 There is no real way to know this stuff unless you have the entire unicode table loaded.除非加载了整个 unicode 表，否则没有真正的方法可以了解这些内容。

An invalid sequence无效序列

Usually when converting bytes to text (char, String, Reader, Writer, StringBuilder - anything that is character oriented), and you attempt to convert an invalid byte sequence, you either get an exception or if the process is in lenient mode, the failure is converted to a character that means 'this was not valid'.通常在将字节转换为文本（char、String、Reader、Writer、StringBuilder - 任何面向字符的东西）时，如果您尝试转换无效的字节序列，您要么得到一个异常，要么如果进程处于宽松模式，则失败转换为表示“这无效”的字符。

If the exception occured, then you couldn't possibly have a char array (the exception occurred instead of returning a char array).如果发生异常，那么你不可能有一个 char 数组（发生异常而不是返回一个 char 数组）。 If it didn't, you have that glyph in your characters, so just search for that.如果没有，则说明您的字符中有该字形，因此只需搜索它即可。

UTF8 验证可以在 char[] 上完成，还是必须在原始 byte[] 上完成？

问题描述

1 个解决方案

解决方案1
4 2021-07-14 16:22:53

UTF8 验证可以在 char[] 上完成，还是必须在原始 byte[] 上完成？

问题描述

1 个解决方案

解决方案1 4 2021-07-14 16:22:53

解决方案1
4 2021-07-14 16:22:53