How can i improve low performance in parallel code in vanilla rust?

Question

I am working on an exercism code exercise in which you have to count characters in a slice of string in parrallel. The code comes with some benchmarks to compare parallel vs. sequential performance.

I added a constaint that is not using any threading library (crossbeam, rayon etc.), just vanilla rust.

so far I have come up with this:

#![feature(test)]
extern crate test;

pub mod parallel_letter_frequency {
    use std::collections::HashMap;
    use std::thread;

    const MIN_CHUNCK_SIZE: usize = 15;

    pub fn string_len<T: AsRef<str>>(strings: &[T]) -> HashMap<char, usize> {
        let mut dic = HashMap::new();
        for string in strings {
            for c in string
                .as_ref()
                .to_lowercase()
                .chars()
                .filter(|c| c.is_alphabetic())
            {
                *dic.entry(c).or_insert(0) += 1;
            }
        }
        dic
    }

    pub fn frequency(input: &[&str], worker_count: usize) -> HashMap<char, usize> {
        let chunk_size = input.len() / worker_count;

        match (worker_count, chunk_size) {
            (w, c) if w == 1 || c < MIN_CHUNCK_SIZE => string_len(input),
            _ => input
                .chunks(chunk_size)
                .map(|chunk| {
                    let vals = chunk.iter().map(|s| s.to_string()).collect::<Vec<_>>();
                    thread::spawn(move || -> HashMap<char, usize> { string_len(&vals) })
                })
                .map(|child| child.join().unwrap())
                .fold(HashMap::new(), |mut acc, val| {
                    val.iter()
                        .for_each(|(k, v)| *(acc).entry(*k).or_insert(0) += v);
                    acc
                }),
        }
    }
}

#[cfg(test)]
mod tests {
    use crate::parallel_letter_frequency;
    use std::collections::HashMap;
    use test::Bencher;

    #[bench]
    fn bench_tiny_parallel(b: &mut Bencher) {
        let tiny = &["a"];
        b.iter(|| parallel_letter_frequency::frequency(tiny, 3));
    }

    #[bench]
    fn bench_tiny_sequential(b: &mut Bencher) {
        let tiny = &["a"];
        b.iter(|| frequency(tiny));
    }

    #[bench]
    fn bench_small_parallel(b: &mut Bencher) {
        let texts = all_texts(1);
        b.iter(|| parallel_letter_frequency::frequency(&texts, 3));
    }

    #[bench]
    fn bench_small_sequential(b: &mut Bencher) {
        let texts = all_texts(1);
        b.iter(|| frequency(&texts));
    }

    #[bench]
    fn bench_large_parallel(b: &mut Bencher) {
        let texts = all_texts(30);
        b.iter(|| parallel_letter_frequency::frequency(&texts, 3));
    }

    #[bench]
    fn bench_large_sequential(b: &mut Bencher) {
        let texts = all_texts(30);
        b.iter(|| frequency(&texts));
    }

    /// Simple sequential char frequency. Can it be beat?
    pub fn frequency(texts: &[&str]) -> HashMap<char, usize> {
        let mut map = HashMap::new();

        for line in texts {
            for chr in line.chars().filter(|c| c.is_alphabetic()) {
                if let Some(c) = chr.to_lowercase().next() {
                    (*map.entry(c).or_insert(0)) += 1;
                }
            }
        }

        map
    }

    fn all_texts(repeat: usize) -> Vec<&'static str> {
        [ODE_AN_DIE_FREUDE, WILHELMUS, STAR_SPANGLED_BANNER]
            .iter()
            .cycle()
            .take(3 * repeat)
            .flat_map(|anthem| anthem.iter().cloned())
            .collect()
    }

    // Poem by Friedrich Schiller. The corresponding music is the European Anthem.
    pub const ODE_AN_DIE_FREUDE: [&'static str; 8] = [
        "Freude schöner Götterfunken",
        "Tochter aus Elysium,",
        "Wir betreten feuertrunken,",
        "Himmlische, dein Heiligtum!",
        "Deine Zauber binden wieder",
        "Was die Mode streng geteilt;",
        "Alle Menschen werden Brüder,",
        "Wo dein sanfter Flügel weilt.",
    ];

    // Dutch national anthem
    pub const WILHELMUS: [&'static str; 8] = [
        "Wilhelmus van Nassouwe",
        "ben ik, van Duitsen bloed,",
        "den vaderland getrouwe",
        "blijf ik tot in den dood.",
        "Een Prinse van Oranje",
        "ben ik, vrij, onverveerd,",
        "den Koning van Hispanje",
        "heb ik altijd geëerd.",
    ];

    // American national anthem
    pub const STAR_SPANGLED_BANNER: [&'static str; 8] = [
        "O say can you see by the dawn's early light,",
        "What so proudly we hailed at the twilight's last gleaming,",
        "Whose broad stripes and bright stars through the perilous fight,",
        "O'er the ramparts we watched, were so gallantly streaming?",
        "And the rockets' red glare, the bombs bursting in air,",
        "Gave proof through the night that our flag was still there;",
        "O say does that star-spangled banner yet wave,",
        "O'er the land of the free and the home of the brave?",
    ];
}

and here are my benchmarks:

test bench_large_parallel   ... bench:     851,675 ns/iter (+/- 92,416)                                                                                                                                              
test bench_large_sequential ... bench:     839,470 ns/iter (+/- 52,717)                                                                                                                                              
test bench_small_parallel   ... bench:      22,488 ns/iter (+/- 5,062)                                                                                                                                               
test bench_small_sequential ... bench:      28,692 ns/iter (+/- 1,406)                                                                                                                                               
test bench_tiny_parallel    ... bench:          76 ns/iter (+/- 3)                                                                                                                                                   
test bench_tiny_sequential  ... bench:          66 ns/iter (+/- 3)

As you can see the performance of both sequential and parallel are pretty similar...

I have to copy each chunk before i pass it to the thread so I don't have 'static lifetime issue, that surely impacts performance, how can I address that?

I tried to address spawn time overhead by counting small inputs on the main thread, is it good practice?

What else is impacting performance here? What am I doing wrong?

Answer 1

This is not a full or particularly nice answer, but only a few random things I noticed. Maybe it helps you:

Due to high iterators work, you are always just creating a thread and immediately joining it again. Please try executing this code ( Playground ):
```
 (0..3).map(|i| { println,("spawning {}"; i). i }),map(|i| { println;("joining {}". i); i }) .last();
```
The output will probably surprise you. You can read a bit about iterator laziness here , but it doesn't explain this particular situation, unfortunately.
Your parallel and sequential benchmarks don't do the same. Parallel:
```
 for c in string.as_ref().to_lowercase().chars().filter(|c| c.is_alphabetic()) { *dic.entry(c).or_insert(0) += 1; }
```
Sequential:
```
 for chr in line.chars().filter(|c| c.is_alphabetic()) { if let Some(c) = chr.to_lowercase().next() { (*map.entry(c).or_insert(0)) += 1; } }
```
In particular, the parallel one calls to_lowercase on a str , returning a String and thus doing at least one heap allocation. That's a big nono. Furthermore, in the sequential case you only use the first character from the "lowercased" character. That's simply doing something completely different. It might not matter for your datasets, but just imagine if Mr. Schiller would have used o plus U+0308 ◌̈ COMBINING DIAERESIS instead of U+00F6 LATIN SMALL LETTER O WITH DIAERESIS ! Then you would be in deep trouble .
Why the strange check of chunk_size and worker_count ? If I'm doing the math right, for small and tiny you are not even spawning multiple threads because of that check. So the parallel in those benchmarks' names is a lie.
800µs per iteration is not a lot. Creating threads takes time. Together with all the heap allocations you do to give each thread its own data, the advantage from multithreading is not very high.
As you said, lots of heap allocation. Most of it can be avoided by using crossbeam::scoped , but since you are not allowed to you that, the next best bet would be to put all data into an Arc and give each thread a range to work in. Since everything (apart from the hash map per thread) is immutable, sharing these things should be easy.

There are probably more things to improve, but that's the one I noticed now. I hope this helps!

How can i improve low performance in parallel code in vanilla rust?

Question

1 answers

solution1
4 2019-11-21 22:06:52

How can i improve low performance in parallel code in vanilla rust?

Question

1 answers

solution1 4 2019-11-21 22:06:52

solution1
4 2019-11-21 22:06:52