实现一个函数来检查字符串/字节数组是否遵循utf-8格式

Question

I am trying to solve this interview question. 我正在努力解决这个面试问题。

After given clearly definition of UTF-8 format. 在明确定义了UTF-8格式之后。 ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. 例如：1字节：0b0xxxxxxx 2字节：....要求编写一个函数来验证输入是否有效UTF-8。 Input will be string/byte array, output should be yes/no. 输入将是字符串/字节数组，输出应为是/否。

I have two possible approaches. 我有两种可能的方法。

First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. 首先，如果输入是一个字符串，因为UTF-8最多是4个字节，在我们删除前两个字符“0b”之后，我们可以使用Integer.parseInt（s）来检查字符串的其余部分是否在范围0到10FFFF。 Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. 此外，最好检查字符串的长度是否为8的倍数，以及输入字符串是否首先包含全0和1。 So I will have to go through the string twice and the complexity will be O(n). 所以我将不得不经历两次字符串，复杂性将是O（n）。

Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. 其次，如果输入是字节数组（如果输入是字符串，我们也可以使用此方法），我们检查每个1字节元素是否在正确的范围内。 If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range. 如果输入是一个字符串，首先检查字符串的长度是8的倍数然后检查每个8字符的子字符串是否在该范围内。

I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question. 我知道有很多关于如何使用Java库检查字符串的解决方案，但我的问题是我应该如何根据问题实现该功能。

Thanks a lot. 非常感谢。

Answer 1

Let's first have a look at a visual representation of the UTF-8 design . 让我们首先看一下UTF-8设计的直观表示。

在此输入图像描述

Now let's resume what we have to do. 现在让我们恢复我们要做的事情。

Loop over all character of the string (each character being a byte). 循环遍历字符串的所有字符（每个字符都是一个字节）。
We will need to apply a mask to each byte depending on the codepoint as the x characters represent the actual codepoint. 我们需要根据代码点对每个字节应用一个掩码，因为x字符代表实际的代码点。 We will use the binary AND operator ( & ) which copy a bit to the result if it exists in both operands. 如果两个操作数都存在，我们将使用二进制AND运算符（ & ）将结果复制到结果中。
The goal of applying a mask is to remove the trailing bits so we compare the actual byte as the first code point. 应用掩码的目的是删除尾随位，以便将实际字节作为第一个代码点进行比较。 We will do the bitwise operation using 0b1xxxxxxx where 1 will appear "Bytes in sequence" time, and other bits will be 0. 我们将使用0b1xxxxxxx进行按位操作，其中1将出现“顺序字节”时间，其他位将为0。
We can then compare with the first byte to verify if it is valid, and also determinate what is the actual byte. 然后我们可以与第一个字节进行比较以验证它是否有效，并确定实际字节是什么。
If the character entered in none of the case, it means the byte is invalid and we return "No". 如果输入的字符均不包括，则表示该字节无效，我们返回“否”。
If we can get out of the loop, that means each of the character are valid, hence the string is valid. 如果我们可以离开循环，这意味着每个字符都是有效的，因此字符串是有效的。
Make sure the comparison that returned true correspond to the expected length. 确保返回true的比较对应于预期长度。

The method would look like this : 该方法如下所示：

public static final boolean isUTF8(final byte[] pText) {

    int expectedLength = 0;

    for (int i = 0; i < pText.length; i++) {
        if ((pText[i] & 0b10000000) == 0b00000000) {
            expectedLength = 1;
        } else if ((pText[i] & 0b11100000) == 0b11000000) {
            expectedLength = 2;
        } else if ((pText[i] & 0b11110000) == 0b11100000) {
            expectedLength = 3;
        } else if ((pText[i] & 0b11111000) == 0b11110000) {
            expectedLength = 4;
        } else if ((pText[i] & 0b11111100) == 0b11111000) {
            expectedLength = 5;
        } else if ((pText[i] & 0b11111110) == 0b11111100) {
            expectedLength = 6;
        } else {
            return false;
        }

        while (--expectedLength > 0) {
            if (++i >= pText.length) {
                return false;
            }
            if ((pText[i] & 0b11000000) != 0b10000000) {
                return false;
            }
        }
    }

    return true;
}

Edit : The actual method is not the original one (almost, but not) and is stolen from here . 编辑：实际的方法不是原始方法（几乎，但不是），并从这里被盗。 The original one was not properly working as per @EJP comment. 根据@EJP评论原来的那个没有正常工作。

Answer 2

A small solution for real world UTF-8 compatibility checking: 现实世界UTF-8兼容性检查的小解决方案：

public static final boolean isUTF8(final byte[] inputBytes) {
    final String converted = new String(inputBytes, StandardCharsets.UTF_8);
    final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
    return Arrays.equals(inputBytes, outputBytes);
}

You can check the tests results: 您可以检查测试结果：

@Test
public void testEnconding() {

    byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
    byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
    byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
    byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };

    assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
    assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
    assertThat(isUTF8(validUTF8Bytes1)).isTrue();
    assertThat(isUTF8(validUTF8Bytes2)).isTrue();
    assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}

Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array 测试用例从https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array复制

Answer 3

public static boolean validUTF8(byte[] input) {
    int i = 0;
    // Check for BOM
    if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
            && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
        i = 3;
    }

    int end;
    for (int j = input.length; i < j; ++i) {
        int octet = input[i];
        if ((octet & 0x80) == 0) {
            continue; // ASCII
        }

        // Check for UTF-8 leading byte
        if ((octet & 0xE0) == 0xC0) {
            end = i + 1;
        } else if ((octet & 0xF0) == 0xE0) {
            end = i + 2;
        } else if ((octet & 0xF8) == 0xF0) {
            end = i + 3;
        } else {
            // Java only supports BMP so 3 is max
            return false;
        }

        while (i < end) {
            i++;
            octet = input[i];
            if ((octet & 0xC0) != 0x80) {
                // Not a valid trailing byte
                return false;
            }
        }
    }
    return true;
}

Answer 4

Well, I am grateful for the comments and the answer. 好的，我很感谢评论和答案。 First of all, I have to agree that this is "another stupid interview question". 首先，我必须同意这是“另一个愚蠢的面试问题”。 It is true that in Java String is already encoded, so it will always be compatible with UTF-8. 确实，在Java中，String已经被编码，因此它始终与UTF-8兼容。 One way to check it is given a string: 检查它的一种方法是给出一个字符串：

public static boolean isUTF8(String s){
    try{
        byte[]bytes = s.getBytes("UTF-8");
    }catch(UnsupportedEncodingException e){
        e.printStackTrace();
        System.exit(-1);
    }
    return true;
}

However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error. 但是，由于所有可打印的字符串都是unicode形式，所以我没有机会得到错误。

Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range. 其次，如果给定一个字节数组，它将始终在-2 ^ 7（0b10000000）到2 ^ 7（0b1111111）的范围内，因此它将始终处于有效的UTF-8范围内。

My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong. 我对这个问题的初步理解是，给定一个字符串，说“0b11111111”，检查它是否是有效的UTF-8，我想我错了。

Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here . 此外，Java确实提供了将字节数组转换为字符串的构造函数，如果您对解码方法感兴趣，请在此处查看。

One more thing, the above answer would be correct given another language. 还有一件事，上面的答案对于另一种语言是正确的。 The only improvement could be: 唯一的改进可能是：

In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. 2003年11月，UTF-8被RFC 3629限制为以U + 10FFFF结束，以匹配UTF-16字符编码的约束。 This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences. 这删除了所有5字节和6字节序列，以及大约一半的4字节序列。

So 4 bytes would be enough. 所以4个字节就足够了。

I am definitely to this, so correct me if I am wrong. 我绝对是这个，所以如果我错了，请纠正我。 Thanks a lot. 非常感谢。

Answer 5

the CharsetDecoder might be what you are looking for: CharsetDecoder可能就是你要找的东西：

@Test
public void testUTF8() throws CharacterCodingException {
    // the desired charset
    final Charset UTF8 = Charset.forName("UTF-8");
    // prepare decoder
    final CharsetDecoder decoder = UTF8.newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

    byte[] bytes = new byte[48];
    new Random().nextBytes(bytes);
    ByteBuffer buffer = ByteBuffer.wrap(bytes);
    try {
        decoder.decode(buffer);
        fail("Should not be UTF-8");
    } catch (final CharacterCodingException e) {
        // noop, the test should fail here
    }

    final String string = "hallo welt!";
    bytes = string.getBytes(UTF8);
    buffer = ByteBuffer.wrap(bytes);
    final String result = decoder.decode(buffer).toString();
    assertEquals(string, result);
}

so your function might look like that: 所以你的功能可能是这样的：

public static boolean checkEncoding(final byte[] bytes, final String encoding) {
    final CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    final ByteBuffer buffer = ByteBuffer.wrap(bytes);

    try {
        decoder.decode(buffer);
        return true;
    } catch (final CharacterCodingException e) {
        return false;
    }
}

实现一个函数来检查字符串/字节数组是否遵循utf-8格式

问题描述

5 个解决方案

解决方案1
13 2015-03-06 04:25:02

解决方案2
3 2018-05-09 05:30:18

解决方案3
1 2017-05-26 05:44:12

解决方案4
0 已采纳 2015-03-08 03:39:57

解决方案5
-1 2017-01-01 17:01:50

实现一个函数来检查字符串/字节数组是否遵循utf-8格式

问题描述

5 个解决方案

解决方案1 13 2015-03-06 04:25:02

解决方案2 3 2018-05-09 05:30:18

解决方案3 1 2017-05-26 05:44:12

解决方案4 0 已采纳 2015-03-08 03:39:57

解决方案5 -1 2017-01-01 17:01:50

解决方案1
13 2015-03-06 04:25:02

解决方案2
3 2018-05-09 05:30:18

解决方案3
1 2017-05-26 05:44:12

解决方案4
0 已采纳 2015-03-08 03:39:57

解决方案5
-1 2017-01-01 17:01:50