在不破壞多字節序列的情況下找到最長的UTF-8序列

Question

我需要將UTF-8編碼的字符串截斷為不超過預定義的字節大小。 特定協議還要求截斷的字符串仍然形成有效的UTF-8編碼，即不必分割多字節序列。

鑒於UTF-8編碼的結構，我可以向前移動，計算每個代碼點的編碼大小，直到我達到最大字節數。 但是，O（n）並不是很吸引人。 是否存在一種算法，它可以更快地完成，理想情況下（攤銷）O（1）時間？

Answer 1

更新2019-06-24： 經過一夜的睡眠，問題似乎比我第一次嘗試看起來容易得多。 出於歷史原因，我在下面留下了上面的答案。

UTF-8編碼是自同步的。 這使得可以確定符號流中任意選擇的代碼單元是否是代碼序列的開始。 UTF-8序列可以拆分到代碼序列開頭的左側。

代碼序列的開頭是ASCII字符（ 0xxxxxxxb ）或多字節序列中的前導字節（ 11xxxxxxb ）。 尾隨字節遵循模式10xxxxxxb 。 UTF-8編碼的開頭滿足條件(code_unit & 0b11000000) != 0b10000000 ，換句話說：它不是尾隨字節。

通過應用以下算法，可以在恆定時間（O（1））中確定不長於請求的字節數的最長UTF-8序列：

如果輸入不超過請求的字節數，則返回實際的字節數。
否則，循環開始（開始一個代碼單元超過請求的字節數），直到我們找到序列的開頭。 將字節計數返回到序列開頭的左側。

放入代碼：

#include <string_view>

size_t find_max_utf8_length(std::string_view sv, size_t max_byte_count)
{
    // 1. Input no longer than max byte count
    if (sv.size() <= max_byte_count)
    {
        return sv.size();
    }

    // 2. Input longer than max byte count
    while ((sv[max_byte_count] & 0b11000000) == 0b10000000)
    {
        --max_byte_count;
    }
    return max_byte_count;
}

這個測試代碼

#include <iostream>
#include <iomanip>
#include <string_view>
#include <string>

int main()
{
    using namespace std::literals::string_view_literals;

    std::cout << "max size output\n=== ==== ======" << std::endl;

    auto test{u8"€«test»"sv};
    for (size_t count{0}; count <= test.size(); ++count)
    {
        auto byte_count{find_max_utf8_length(test, count)};
        std::cout << std::setw(3) << std::setfill(' ') << count
                  << std::setw(5) << std::setfill(' ') << byte_count
                  << " " << std::string(begin(test), byte_count) << std::endl;
    }
}

產生以下輸出：

 max size output === ==== ====== 0 0 1 0 2 0 3 3 € 4 3 € 5 5 €« 6 6 €«t 7 7 €«te 8 8 €«tes 9 9 €«test 10 9 €«test 11 11 €«test»

該算法僅對UTF-8編碼進行操作。 它不會嘗試以任何方式處理Unicode。 雖然它總是會產生有效的UTF-8編碼序列，但編碼的代碼點可能不會形成有意義的Unicode字形。

算法在恆定時間內完成。 無論輸入大小如何，給定每個UTF-8編碼最多4個字節的電流限制，最終循環最多旋轉3次。 如果改變UTF-8編碼以允許每個編碼的代碼點最多5或6個字節，該算法將在恆定時間內繼續工作和完成。

以前的答案

這可以在O（1）中完成，將問題分解為以下情況：

輸入不超過請求的字節數。 在這種情況下，只需返回輸入。
輸入長於請求的字節數。 找出索引max_byte_count - 1編碼的相對位置：
1. 如果這是一個ASCII字符（最高位未設置為0xxxxxxxb ），我們處於自然邊界，並且可以在它之后立即剪切字符串。
2. 否則，我們要么處於多字節序列的開始，中間或尾部。 要找出位置，請考慮以下字符。 如果它是ASCII字符（ 0xxxxxxxb ）或多字節序列（ 11xxxxxxb ）的11xxxxxxb ，我們處於多字節序列的尾部，即自然邊界。
3. 否則，我們要么處於多字節序列的開頭或中間。 迭代字符串的開頭，直到我們找到多字節編碼的開始（ 11xxxxxxb ）。 在該角色之前剪切字符串。

在給定最大字節數的情況下，以下代碼計算截斷字符串的長度。 輸入需要形成有效的UTF-8編碼。

#include <string_view>

size_t find_max_utf8_length(std::string_view sv, size_t max_byte_count)
{
    // 1. No longer than max byte count
    if (sv.size() <= max_byte_count)
    {
        return sv.size();
    }

    // 2. Longer than byte count
    auto c0{static_cast<unsigned char>(sv[max_byte_count - 1])};
    if ((c0 & 0b10000000) == 0)
    {
        // 2.1 ASCII
        return max_byte_count;
    }

    auto c1{static_cast<unsigned char>(sv[max_byte_count])};
    if (((c1 & 0b10000000) == 0) || ((c1 & 0b11000000) == 0b11000000))
    {
        // 2.2. At end of multi-byte sequence
        return max_byte_count;
    }

    // 2.3. At start or middle of multi-byte sequence
    unsigned char c{};
    do
    {
        --max_byte_count;
        c = static_cast<unsigned char>(sv[max_byte_count]);
    } while ((c & 0b11000000) != 0b11000000);
    return max_byte_count;
}

以下測試代碼

#include <iostream>
#include <iomanip>
#include <string_view>
#include <string>

int main()
{
    using namespace std::literals::string_view_literals;

    std::cout << "max size output\n=== ==== ======" << std::endl;

    auto test{u8"€«test»"sv};
    for (size_t count{0}; count <= test.size(); ++count)
    {
        auto byte_count{find_max_utf8_length(test, count)};
        std::cout << std::setw(3) << std::setfill(' ') << count
                  << std::setw(5) << std::setfill(' ') << byte_count
                  << " " << std::string(begin(test), byte_count) << std::endl;
    }
}

產生這個輸出：

 max size output === ==== ====== 0 0 1 0 2 0 3 3 € 4 3 € 5 5 €« 6 6 €«t 7 7 €«te 8 8 €«tes 9 9 €«test 10 9 €«test 11 11 €«test»

在不破壞多字節序列的情況下找到最長的UTF-8序列

問題描述

1 個解決方案

解決方案1
7 已采納 2019-06-23 13:26:19

在不破壞多字節序列的情況下找到最長的UTF-8序列

問題描述

1 個解決方案

解決方案1 7 已采納 2019-06-23 13:26:19

解決方案1
7 已采納 2019-06-23 13:26:19