简体   繁体   English

高效截断字符串副本`str`到`[u8]`(utf8 aware strlcpy)?

[英]Efficient truncating string copy `str` to `[u8]` (utf8 aware strlcpy)?

While Rust provides str.as_bytes , I'm looking to copy a string into a fixed sized buffer, where only full unicode-scalar-values are copied into the buffer, and are instead truncated with a null terminator written at the end, in C terms, I'd call this a utf8 aware strlcpy (that is - it copies into a fixed size buffer and ensures its null terminated) . 虽然Rust提供了str.as_bytes ,但是我希望将一个字符串复制到一个固定大小的缓冲区中,其中只有完整的unicode-scalar-values被复制到缓冲区中,而是在C末尾写入一个空终结符进行截断。这些术语,我称之为utf8感知strlcpy (即 - 它复制到固定大小的缓冲区并确保其null终止)


This is a function I came up with, but I expect there are better ways to do this in Rust: 这是我提出的功能,但我希望在Rust中有更好的方法:

// return the number of bytes written to
pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
    let utf8_dst_len = utf8_dst.len();
    if utf8_dst_len == 0 {
        return 0;
    }
    let mut index: usize = 0;
    if utf8_dst_len > 1 {
        let mut utf8_buf: [u8; 4] = [0; 4];
        for c in str_src.chars() {
            let len_utf8 = c.len_utf8();
            let index_next = index + len_utf8;
            c.encode_utf8(&mut utf8_buf);
            if index_next >= utf8_dst_len {
                break;
            }
            utf8_dst[index..index_next].clone_from_slice(&utf8_buf[0..len_utf8]);
            index = index_next;
        }
    }
    utf8_dst[index] = 0;
    return index + 1;
}

Note): I realize this isn't ideal since multiple UCS may make up a single glyph, however the result will at least be able to decoded back into a str . 注意):我意识到这并不理想,因为多个UCS可能组成一个字形,但结果至少能够解码回str

Rust's str has a handy method char_indices for when you need to know the actual character boundaries. 当你需要知道实际的字符边界时,Rust的str有一个方便的方法char_indices This would immediately simplify your function somewhat: 这会立即简化你的功能:

pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
    let utf8_dst_len = utf8_dst.len();
    if utf8_dst_len == 0 {
        return 0;
    }
    let mut last_index = 0;
    for (idx, _) in str_src.char_indices() {
        if (idx+1) > utf8_dst_len {
            break;
        }
        last_index = idx;
    }
    utf8_dst[0..last_index].copy_from_slice(&str_src.as_bytes()[0..last_index]);
    utf8_dst[last_index] = 0;
    return last_index + 1;
}

Playground 操场

However you don't actually need to iterate through every character except when copying, as it turns out it's easy to find a boundary in UTF8; 然而,除了复制之外,你实际上并不需要遍历每个字符,因为事实证明在UTF8中很容易找到边界; Rust has str::is_char_boundary() . Rust有str::is_char_boundary() This lets you instead look backwards from the end: 这让你可以从最后向后看:

pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
    let utf8_dst_len = utf8_dst.len();
    if utf8_dst_len == 0 {
        return 0;
    }
    let mut last_index = min(utf8_dst_len-1, str_src.len());
    while last_index > 0 && !str_src.is_char_boundary(last_index) {
        last_index -= 1;
    }
    utf8_dst[0..last_index].copy_from_slice(&str_src.as_bytes()[0..last_index]);
    utf8_dst[last_index] = 0;
    return last_index + 1;
}

Playground 操场

Based on Chris Emerson's answer and @Matthieu-m's suggestion to remove a redundant check. 根据Chris Emerson的回答和@ Matthieu-m建议删除冗余支票。

// returns the number of bytes written to
pub fn strlcpy_utf8(utf8_dst: &mut [u8], str_src: &str) -> usize {
    let utf8_dst_len = utf8_dst.len();
    if utf8_dst_len == 0 {
        return 0;
    }
    // truncate if 'str_src' is too long
    let mut last_index = str_src.len();
    if last_index >= utf8_dst_len {
        last_index = utf8_dst_len - 1;
        // no need to check last_index > 0 here,
        // is_char_boundary covers that case
        while !str_src.is_char_boundary(last_index) {
            last_index -= 1;
        }
    }
    utf8_dst[0..last_index].clone_from_slice(&str_src.as_bytes()[0..last_index]);
    utf8_dst[last_index] = 0;
    return last_index + 1;
}

@ChrisEmerson: I'm posting this since it's the code I'm going to use for my project, feel free to update your answer with the changes if you like and I'll remove this answer. @ChrisEmerson:我发布这个,因为它是我将用于我的项目的代码,如果你愿意,可以随时更新你的答案,我会删除这个答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM