简体   繁体   中英

How to speed up UTF-8 string processing

I am parsing tab-separated values:

pub fn parse_tsv(line: &str) -> MyType {
    for (i, value) in line.split('\t').enumerate() {
        // ...
    }
    // ...
}

perf top contains str.find . When I look in the generated assembly code, there is much work related to UTF-8 coding of the symbols in &str .

And it is relatively veeeery slow. It takes 99% of the execution time.

But to find \\t I can't simply search for one-byte \\t in a UTF-8 string.

What am I doing wrong? What is Rust stdlib doing wrong?

Or maybe in Rust there is a some string library which can represent strings simply by 'u8' bytes? But with all the split() , find() , and other methods?

As long as your string is ASCII or you don't need to match on UTF-8 scalars (eg like in your case, where you search for tabs), you can just treat it as bytes with the as_bytes() method and afterwards operate on u8 characters (bytes) instead of char s (UTF-8 scalars). This should be much faster. With &[u8] , which is a slice , you can still use methods applicable to &str slices like split() , find() , etc.

let line = String::new();
let bytes = line.as_bytes();

pub fn parse_tsv(line: &[u8]) {
    for (i, value) in line.split(|c| *c == b'\t').enumerate() {

    }
}

fn main() {
    let line = String::new();
    let bytes = line.as_bytes();

    parse_tsv(&bytes)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM