简体   繁体   English

实现一个函数来检查字符串/字节数组是否遵循utf-8格式

[英]Implement a function to check if a string/byte array follows utf-8 format

I am trying to solve this interview question. 我正在努力解决这个面试问题。

After given clearly definition of UTF-8 format. 在明确定义了UTF-8格式之后。 ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. 例如:1字节:0b0xxxxxxx 2字节:....要求编写一个函数来验证输入是否有效UTF-8。 Input will be string/byte array, output should be yes/no. 输入将是字符串/字节数组,输出应为是/否。

I have two possible approaches. 我有两种可能的方法。

First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. 首先,如果输入是一个字符串,因为UTF-8最多是4个字节,在我们删除前两个字符“0b”之后,我们可以使用Integer.parseInt(s)来检查字符串的其余部分是否在范围0到10FFFF。 Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. 此外,最好检查字符串的长度是否为8的倍数,以及输入字符串是否首先包含全0和1。 So I will have to go through the string twice and the complexity will be O(n). 所以我将不得不经历两次字符串,复杂性将是O(n)。

Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. 其次,如果输入是字节数组(如果输入是字符串,我们也可以使用此方法),我们检查每个1字节元素是否在正确的范围内。 If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range. 如果输入是一个字符串,首先检查字符串的长度是8的倍数然后检查每个8字符的子字符串是否在该范围内。

I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question. 我知道有很多关于如何使用Java库检查字符串的解决方案,但我的问题是我应该如何根据问题实现该功能。

Thanks a lot. 非常感谢。

Let's first have a look at a visual representation of the UTF-8 design . 让我们首先看一下UTF-8设计直观表示

在此输入图像描述


Now let's resume what we have to do. 现在让我们恢复我们要做的事情。

  • Loop over all character of the string (each character being a byte). 循环遍历字符串的所有字符(每个字符都是一个字节)。
  • We will need to apply a mask to each byte depending on the codepoint as the x characters represent the actual codepoint. 我们需要根据代码点对每个字节应用一个掩码,因为x字符代表实际的代码点。 We will use the binary AND operator ( & ) which copy a bit to the result if it exists in both operands. 如果两个操作数都存在,我们将使用二进制AND运算符( & )将结果复制到结果中。
  • The goal of applying a mask is to remove the trailing bits so we compare the actual byte as the first code point. 应用掩码的目的是删除尾随位,以便将实际字节作为第一个代码点进行比较。 We will do the bitwise operation using 0b1xxxxxxx where 1 will appear "Bytes in sequence" time, and other bits will be 0. 我们将使用0b1xxxxxxx进行按位操作,其中1将出现“顺序字节”时间,其他位将为0。
  • We can then compare with the first byte to verify if it is valid, and also determinate what is the actual byte. 然后我们可以与第一个字节进行比较以验证它是否有效,并确定实际字节是什么。
  • If the character entered in none of the case, it means the byte is invalid and we return "No". 如果输入的字符均不包括,则表示该字节无效,我们返回“否”。
  • If we can get out of the loop, that means each of the character are valid, hence the string is valid. 如果我们可以离开循环,这意味着每个字符都是有效的,因此字符串是有效的。
  • Make sure the comparison that returned true correspond to the expected length. 确保返回true的比较对应于预期长度。

The method would look like this : 该方法如下所示:

public static final boolean isUTF8(final byte[] pText) {

    int expectedLength = 0;

    for (int i = 0; i < pText.length; i++) {
        if ((pText[i] & 0b10000000) == 0b00000000) {
            expectedLength = 1;
        } else if ((pText[i] & 0b11100000) == 0b11000000) {
            expectedLength = 2;
        } else if ((pText[i] & 0b11110000) == 0b11100000) {
            expectedLength = 3;
        } else if ((pText[i] & 0b11111000) == 0b11110000) {
            expectedLength = 4;
        } else if ((pText[i] & 0b11111100) == 0b11111000) {
            expectedLength = 5;
        } else if ((pText[i] & 0b11111110) == 0b11111100) {
            expectedLength = 6;
        } else {
            return false;
        }

        while (--expectedLength > 0) {
            if (++i >= pText.length) {
                return false;
            }
            if ((pText[i] & 0b11000000) != 0b10000000) {
                return false;
            }
        }
    }

    return true;
}

Edit : The actual method is not the original one (almost, but not) and is stolen from here . 编辑:实际的方法不是原始方法(几乎,但不是),并从这里被盗。 The original one was not properly working as per @EJP comment. 根据@EJP评论原来的那个没有正常工作。

A small solution for real world UTF-8 compatibility checking: 现实世界UTF-8兼容性检查的小解决方案:

public static final boolean isUTF8(final byte[] inputBytes) {
    final String converted = new String(inputBytes, StandardCharsets.UTF_8);
    final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
    return Arrays.equals(inputBytes, outputBytes);
}

You can check the tests results: 您可以检查测试结果:

@Test
public void testEnconding() {

    byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
    byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
    byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
    byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };

    assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
    assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
    assertThat(isUTF8(validUTF8Bytes1)).isTrue();
    assertThat(isUTF8(validUTF8Bytes2)).isTrue();
    assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}

Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array 测试用例从https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array复制

public static boolean validUTF8(byte[] input) {
    int i = 0;
    // Check for BOM
    if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
            && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
        i = 3;
    }

    int end;
    for (int j = input.length; i < j; ++i) {
        int octet = input[i];
        if ((octet & 0x80) == 0) {
            continue; // ASCII
        }

        // Check for UTF-8 leading byte
        if ((octet & 0xE0) == 0xC0) {
            end = i + 1;
        } else if ((octet & 0xF0) == 0xE0) {
            end = i + 2;
        } else if ((octet & 0xF8) == 0xF0) {
            end = i + 3;
        } else {
            // Java only supports BMP so 3 is max
            return false;
        }

        while (i < end) {
            i++;
            octet = input[i];
            if ((octet & 0xC0) != 0x80) {
                // Not a valid trailing byte
                return false;
            }
        }
    }
    return true;
}

Well, I am grateful for the comments and the answer. 好的,我很感谢评论和答案。 First of all, I have to agree that this is "another stupid interview question". 首先,我必须同意这是“另一个愚蠢的面试问题”。 It is true that in Java String is already encoded, so it will always be compatible with UTF-8. 确实,在Java中,String已经被编码,因此它始终与UTF-8兼容。 One way to check it is given a string: 检查它的一种方法是给出一个字符串:

public static boolean isUTF8(String s){
    try{
        byte[]bytes = s.getBytes("UTF-8");
    }catch(UnsupportedEncodingException e){
        e.printStackTrace();
        System.exit(-1);
    }
    return true;
}

However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error. 但是,由于所有可打印的字符串都是unicode形式,所以我没有机会得到错误。

Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range. 其次,如果给定一个字节数组,它将始终在-2 ^ 7(0b10000000)到2 ^ 7(0b1111111)的范围内,因此它将始终处于有效的UTF-8范围内。

My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong. 我对这个问题的初步理解是,给定一个字符串,说“0b11111111”,检查它是否是有效的UTF-8,我想我错了。

Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here . 此外,Java确实提供了将字节数组转换为字符串的构造函数,如果您对解码方法感兴趣,请在此处查看

One more thing, the above answer would be correct given another language. 还有一件事,上面的答案对于另一种语言是正确的。 The only improvement could be: 唯一的改进可能是:

In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. 2003年11月,UTF-8被RFC 3629限制为以U + 10FFFF结束,以匹配UTF-16字符编码的约束。 This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences. 这删除了所有5字节和6字节序列,以及大约一半的4字节序列。

So 4 bytes would be enough. 所以4个字节就足够了。

I am definitely to this, so correct me if I am wrong. 我绝对是这个,所以如果我错了,请纠正我。 Thanks a lot. 非常感谢。

the CharsetDecoder might be what you are looking for: CharsetDecoder可能就是你要找的东西:

@Test
public void testUTF8() throws CharacterCodingException {
    // the desired charset
    final Charset UTF8 = Charset.forName("UTF-8");
    // prepare decoder
    final CharsetDecoder decoder = UTF8.newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

    byte[] bytes = new byte[48];
    new Random().nextBytes(bytes);
    ByteBuffer buffer = ByteBuffer.wrap(bytes);
    try {
        decoder.decode(buffer);
        fail("Should not be UTF-8");
    } catch (final CharacterCodingException e) {
        // noop, the test should fail here
    }

    final String string = "hallo welt!";
    bytes = string.getBytes(UTF8);
    buffer = ByteBuffer.wrap(bytes);
    final String result = decoder.decode(buffer).toString();
    assertEquals(string, result);
}

so your function might look like that: 所以你的功能可能是这样的:

public static boolean checkEncoding(final byte[] bytes, final String encoding) {
    final CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    final ByteBuffer buffer = ByteBuffer.wrap(bytes);

    try {
        decoder.decode(buffer);
        return true;
    } catch (final CharacterCodingException e) {
        return false;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM