Rust vs python program performance results question

Question

I wrote a program that count words.

Here is the program

use std::collections::HashMap;
use std::io;
use std::io::prelude::*;

#[derive(Debug)]
struct Entry {
    word: String,
    count: u32,
}

static SEPARATORS: &'static [char] = &[
    ' ', ',', '.', '!', '?', '\'', '"', '\n', '(', ')', '#', '{', '}', '[', ']', '-', ';', ':',
];

fn main() {
    if let Err(err) = try_main() {
        if err.kind() == std::io::ErrorKind::BrokenPipe {
            return;
        }
        // Ignore any error that may occur while writing to stderr.
        let _ = writeln!(std::io::stderr(), "{}", err);
    }
}

fn try_main() -> Result<(), std::io::Error> {
    let mut words: HashMap<String, u32> = HashMap::new();
    let stdin = io::stdin();
    for result in stdin.lock().lines() {
        let line = result?;
        line_processor(line, &mut words)
    }
    output(&mut words)?;
    Ok(())
}

fn line_processor(line: String, words: &mut HashMap<String, u32>) {
    let mut word = String::new();

    for c in line.chars() {
        if SEPARATORS.contains(&c) {
            add_word(word, words);
            word = String::new();
        } else {
            word.push_str(&c.to_string());
        }
    }
}

fn add_word(word: String, words: &mut HashMap<String, u32>) {
    if word.len() > 0 {
        if words.contains_key::<str>(&word) {
            words.insert(word.to_string(), words.get(&word).unwrap() + 1);
        } else {
            words.insert(word.to_string(), 1);
        }
        // println!("word >{}<", word.to_string())
    }
}

fn output(words: &mut HashMap<String, u32>) -> Result<(), std::io::Error> {
    let mut stack = Vec::<Entry>::new();

    for (k, v) in words {
        stack.push(Entry {
            word: k.to_string(),
            count: *v,
        });
    }

    stack.sort_by(|a, b| b.count.cmp(&a.count));
    stack.reverse();

    let stdout = io::stdout();
    let mut stdout = stdout.lock();
    while let Some(entry) = stack.pop() {
        writeln!(stdout, "{}\t{}", entry.count, entry.word)?;
    }
    Ok(())
}

It some arbitrary text file as input and counts words to produce some output like:

15  the
14  in
11  are
10  and
10  of
9   species
9   bats
8   horseshoe
8   is
6   or
6   as
5   which
5   their

I compile it like this:

cargo build --release

I run it like that:

cat wiki-sample.txt | ./target/release/wordstats  | head -n 50

wiki-sample.txt file I use is here

I compared execution time with the python (3.8) version which is:

import sys
from collections import defaultdict

# import unidecode

seps = set(
    [
        " ",
        ",",
        ".",
        "!",
        "?",
        "'",
        '"',
        "\n",
        "(",
        ")",
        "#",
        "{",
        "}",
        "[",
        "]",
        "-",
        ";",
        ":",
    ]
)


def out(result):
    for i in result:
        print(f"{i[1]}\t{i[0]}")


if __name__ == "__main__":
    c = defaultdict(int)

    for line in sys.stdin:
        words = line.split(" ")
        for word in words:
            clean_word = []
            for char in word:
                if char not in seps and char:
                    clean_word.append(char)
            r = "".join(clean_word)
            # r = unidecode.unidecode(r)
            if r:
                c[r] += 1

    r = sorted(list(c.items()), key=lambda x: -x[1])
    try:
        out(r)
    except BrokenPipeError as e:
        pass

I run it like this:

cat /tmp/t.txt | ./venv/bin/python3 src/main.py | head -n 100

Average computation times are: rust -> 5', python3.8 -> 19'
python version is (i think) less optimized (a split on the whole line requires an extra O(n))
This is single threaded process, and a quite simple program
Most of computing time is in the word loop processing, output is almost instant.
I also removed library code that remove accents to be more close to standard libraries of both languages.

Question : Is it normal that rust performs "only" ~3-4 times better?

I am also wondering if I am not missing something here because I find computation time is quite long for "only" 100Mb data. I don't think (naively) there is some processing with a lower big O for this, I might be wrong.

I am used to compare some python code to some equivalent in go, java or vlang and I often have something like 20x to 100x factor speed for these benches.

Maybe cpython is good at this kind of processing, maybe I miss something in rust program (I am very new to rust) to make it more efficient.

I am frighten to miss something big in my tests, but any thought about this?

Edit: following folks advices, I have now version below:

use std::collections::HashMap;
use std::io;
use std::io::prelude::*;

#[derive(Debug)]
struct Entry<'a> {
    word: &'a str, // word: String,
    count: u32,
}

static SEPARATORS: &'static [char] = &[
    ' ', ',', '.', '!', '?', '\'', '"', '\n', '(', ')', '#', '{', '}', '[', ']', '-', ';', ':',
];

fn main() {
    if let Err(err) = try_main() {
        if err.kind() == std::io::ErrorKind::BrokenPipe {
            return;
        }
        // Ignore any error that may occur while writing to stderr.
        let _ = writeln!(std::io::stderr(), "{}", err);
    }
}

fn try_main() -> Result<(), std::io::Error> {
    let mut words: HashMap<String, u32> = HashMap::new();
    let stdin = io::stdin();
    for result in stdin.lock().lines() {
        let line = result?;
        line_processor(line, &mut words)
    }
    output(&mut words)?;
    Ok(())
}

fn line_processor(line: String, words: &mut HashMap<String, u32>) {
    let mut l = line.as_str();
    loop {
        if let Some(pos) = l.find(|c: char| SEPARATORS.contains(&c)) {
            let (head, tail) = l.split_at(pos);
            add_word(head.to_owned(), words);
            l = &tail[1..];
        } else {
            break;
        }
    }
}

fn add_word(word: String, words: &mut HashMap<String, u32>) {
    if word.len() > 0 {
        let count = words.entry(word).or_insert(0);
        *count += 1;
    }
}

fn output(words: &mut HashMap<String, u32>) -> Result<(), std::io::Error> {
    let mut stack = Vec::<Entry>::new();

    for (k, v) in words {
        stack.push(Entry {
            word: k.as_str(), // word: k.to_string(),
            count: *v,
        });
    }

    stack.sort_by(|a, b| a.count.cmp(&b.count));

    let stdout = io::stdout();
    let mut stdout = stdout.lock();
    while let Some(entry) = stack.pop() {
        writeln!(stdout, "{}\t{}", entry.count, entry.word)?;
    }
    Ok(())
}

Which takes arount 2.6' on my computer now. This is way better and almost 10 times faster than python version which is very better but still not what I expected (that is not a real problem). There might be some other optimisations that I does not have in mind for now.

Answer 1

You can go a bit faster by avoiding UTF-8 validation and making your search a bit smarter by using the bstr crate.

use std::io;
use std::io::prelude::*;

use bstr::{BStr, BString, io::BufReadExt, ByteSlice};

type HashMap<K, V> = fnv::FnvHashMap<K, V>;

#[derive(Debug)]
struct Entry<'a> {
    word: &'a BStr,
    count: u32,
}

static SEPSET: &'static [u8] = b" ,.!?'\"\n()#{}[]-;:";

fn main() {
    if let Err(err) = try_main() {
        if err.kind() == std::io::ErrorKind::BrokenPipe {
            return;
        }
        // Ignore any error that may occur while writing to stderr.
        let _ = writeln!(std::io::stderr(), "{}", err);
    }
}

fn try_main() -> Result<(), std::io::Error> {
    let mut words: HashMap<BString, u32> = HashMap::default();
    io::stdin().lock().for_byte_line(|line| {
        line_processor(line, &mut words);
        Ok(true)
    })?;
    output(&mut words)?;
    Ok(())
}

fn line_processor(mut line: &[u8], words: &mut HashMap<BString, u32>) {
    loop {
        if let Some(pos) = line.find_byteset(SEPSET) {
            let (head, tail) = line.split_at(pos);
            add_word(head, words);
            line = &tail[1..];
        } else {
            break;
        }
    }
}

fn add_word(word: &[u8], words: &mut HashMap<BString, u32>) {
    if word.len() > 0 {
        // The vast majority of the time we are looking
        // up a word that already exists, so don't bother
        // allocating in the common path. This means the
        // uncommon path does two lookups, but it's so
        // uncommon that the overall result is much faster.
        if let Some(count) = words.get_mut(word.as_bstr()) {
            *count += 1;
        } else {
            words.insert(BString::from(word), 1);
        }
    }
}

fn output(words: &mut HashMap<BString, u32>) -> Result<(), std::io::Error> {
    let mut stack = Vec::<Entry>::new();

    for (k, v) in words {
        stack.push(Entry {
            word: k.as_bstr(),
            count: *v,
        });
    }

    stack.sort_by(|a, b| a.count.cmp(&b.count));

    let stdout = io::stdout();
    let mut stdout = stdout.lock();
    while let Some(entry) = stack.pop() {
        writeln!(stdout, "{}\t{}", entry.count, entry.word)?;
    }
    Ok(())
}

At this point, most of the time of the program is spent in the hashmap lookup. (Which is why I switched to using fnv above.) So making it faster at this point probably means using a different strategy for maintaining a map of words. My guess is that most words are only a couple bytes in length, so you could special case those to use an array as a map instead of a hashmap. It could result in a substantial speed up, but also complicates your original program a bit more.

As to whether this speed is what one would expect, I would say, "it seems about right to me." Your program is taking an action one every word in a 14.5 million word document. The program above takes about 1.7 seconds on my machine, which means it's processing about 8.3 million words per second, or about 8.3 words per microsecond. That seems about right given that every word does a hash lookup and requires a search to find the next word.

Answer 2

In add_word() you circumvent the borrowing problems with new copies of word ( .to_string() ).

You could just access once for all the counter you want to increment.

let count = words.entry(word).or_insert(0);
*count += 1;

You could also avoid many string reallocations in line_processor() by working directly on the line as a &str .

let mut l = line.as_str();
loop {
    if let Some(pos) = l.find(|c: char| SEPARATORS.contains(&c)) {
        let (head, tail) = l.split_at(pos);
        add_word(head.to_owned(), words);
        l = &tail[1..];
    } else {
        break;
    }
}

When it comes to the output() function new copies of the strings are performed in order to initialise the Entry struct. We could change Entry to

#[derive(Debug)]
struct Entry<'a> {
    word: &'a str,  // word: String,
    count: u32,
}

and then only work on the &str inside the original strings (in words ).

stack.push(Entry {
    word: k.as_str(), // word: k.to_string(),
    count: *v,
});

Moreover, the inplace reverse of the sorted vector can be avoided if we invert the sorting criterion.

stack.sort_by(|a, b| a.count.cmp(&b.count));
// stack.reverse();

I guess these are the main bottlenecks in this example.

On my computer, timing with <wiki-sample.txt >/dev/null gives these speedups:

original -->  × 1 (reference)
using l.find()+l.split_at() --> × 1.48
using words.entry() --> × 1.25
using both l.find()+l.split_at() and words.entry() --> × 1.73
using all the preceding and &str in Entry and avoiding reverse --> x 2.05

Rust vs python program performance results question

Question

2 answers

solution1
2 2021-01-25 00:38:39

solution2
1 ACCPTED 2021-01-18 21:52:48

Rust vs python program performance results question

Question

2 answers

solution1 2 2021-01-25 00:38:39

solution2 1 ACCPTED 2021-01-18 21:52:48

solution1
2 2021-01-25 00:38:39

solution2
1 ACCPTED 2021-01-18 21:52:48