高效截斷字符串副本`str`到`[u8]`（utf8 aware strlcpy）？

Question

雖然Rust提供了str.as_bytes ，但是我希望將一個字符串復制到一個固定大小的緩沖區中，其中只有完整的unicode-scalar-values被復制到緩沖區中，而是在C末尾寫入一個空終結符進行截斷。這些術語，我稱之為utf8感知strlcpy （即 - 它復制到固定大小的緩沖區並確保其null終止） 。

這是我提出的功能，但我希望在Rust中有更好的方法：

// return the number of bytes written to
pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
    let utf8_dst_len = utf8_dst.len();
    if utf8_dst_len == 0 {
        return 0;
    }
    let mut index: usize = 0;
    if utf8_dst_len > 1 {
        let mut utf8_buf: [u8; 4] = [0; 4];
        for c in str_src.chars() {
            let len_utf8 = c.len_utf8();
            let index_next = index + len_utf8;
            c.encode_utf8(&mut utf8_buf);
            if index_next >= utf8_dst_len {
                break;
            }
            utf8_dst[index..index_next].clone_from_slice(&utf8_buf[0..len_utf8]);
            index = index_next;
        }
    }
    utf8_dst[index] = 0;
    return index + 1;
}

注意）：我意識到這並不理想，因為多個UCS可能組成一個字形，但結果至少能夠解碼回str 。

Answer 1

當你需要知道實際的字符邊界時，Rust的str有一個方便的方法char_indices 。 這會立即簡化你的功能：

pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
    let utf8_dst_len = utf8_dst.len();
    if utf8_dst_len == 0 {
        return 0;
    }
    let mut last_index = 0;
    for (idx, _) in str_src.char_indices() {
        if (idx+1) > utf8_dst_len {
            break;
        }
        last_index = idx;
    }
    utf8_dst[0..last_index].copy_from_slice(&str_src.as_bytes()[0..last_index]);
    utf8_dst[last_index] = 0;
    return last_index + 1;
}

操場

然而，除了復制之外，你實際上並不需要遍歷每個字符，因為事實證明在UTF8中很容易找到邊界; Rust有str::is_char_boundary() 。 這讓你可以從最后向后看：

pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
    let utf8_dst_len = utf8_dst.len();
    if utf8_dst_len == 0 {
        return 0;
    }
    let mut last_index = min(utf8_dst_len-1, str_src.len());
    while last_index > 0 && !str_src.is_char_boundary(last_index) {
        last_index -= 1;
    }
    utf8_dst[0..last_index].copy_from_slice(&str_src.as_bytes()[0..last_index]);
    utf8_dst[last_index] = 0;
    return last_index + 1;
}

操場

Answer 2

根據Chris Emerson的回答和@ Matthieu-m建議刪除冗余支票。

// returns the number of bytes written to
pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
    let utf8_dst_len = utf8_dst.len();
    if utf8_dst_len == 0 {
        return 0;
    }
    // truncate if 'str_src' is too long
    let mut last_index = str_src.len();
    if last_index >= utf8_dst_len {
        last_index = utf8_dst_len - 1;
        // no need to check last_index > 0 here,
        // is_char_boundary covers that case
        while !str_src.is_char_boundary(last_index) {
            last_index -= 1;
        }
    }
    utf8_dst[0..last_index].clone_from_slice(&str_src.as_bytes()[0..last_index]);
    utf8_dst[last_index] = 0;
    return last_index + 1;
}

@ChrisEmerson：我發布這個，因為它是我將用於我的項目的代碼，如果你願意，可以隨時更新你的答案，我會刪除這個答案。

高效截斷字符串副本`str`到`[u8]`（utf8 aware strlcpy）？

問題描述

2 個解決方案

解決方案1
4 2017-02-06 08:47:07

解決方案2
-1 2017-02-06 10:23:42

高效截斷字符串副本`str`到`[u8]`（utf8 aware strlcpy）？

問題描述

2 個解決方案

解決方案1 4 2017-02-06 08:47:07

解決方案2 -1 2017-02-06 10:23:42

解決方案1
4 2017-02-06 08:47:07

解決方案2
-1 2017-02-06 10:23:42