如何改善 vanilla rust 中并行代码的低性能？

Question

I am working on an exercism code exercise in which you have to count characters in a slice of string in parrallel.我正在做一个练习代码练习，在这个练习中你必须以并行方式计算一段字符串中的字符。 The code comes with some benchmarks to compare parallel vs. sequential performance.该代码附带了一些基准来比较并行与顺序性能。

I added a constaint that is not using any threading library (crossbeam, rayon etc.), just vanilla rust.我添加了一个不使用任何线程库（横梁、人造丝等）的约束，只是普通的 rust。

so far I have come up with this:到目前为止，我想出了这个：

#![feature(test)]
extern crate test;

pub mod parallel_letter_frequency {
    use std::collections::HashMap;
    use std::thread;

    const MIN_CHUNCK_SIZE: usize = 15;

    pub fn string_len<T: AsRef<str>>(strings: &[T]) -> HashMap<char, usize> {
        let mut dic = HashMap::new();
        for string in strings {
            for c in string
                .as_ref()
                .to_lowercase()
                .chars()
                .filter(|c| c.is_alphabetic())
            {
                *dic.entry(c).or_insert(0) += 1;
            }
        }
        dic
    }

    pub fn frequency(input: &[&str], worker_count: usize) -> HashMap<char, usize> {
        let chunk_size = input.len() / worker_count;

        match (worker_count, chunk_size) {
            (w, c) if w == 1 || c < MIN_CHUNCK_SIZE => string_len(input),
            _ => input
                .chunks(chunk_size)
                .map(|chunk| {
                    let vals = chunk.iter().map(|s| s.to_string()).collect::<Vec<_>>();
                    thread::spawn(move || -> HashMap<char, usize> { string_len(&vals) })
                })
                .map(|child| child.join().unwrap())
                .fold(HashMap::new(), |mut acc, val| {
                    val.iter()
                        .for_each(|(k, v)| *(acc).entry(*k).or_insert(0) += v);
                    acc
                }),
        }
    }
}

#[cfg(test)]
mod tests {
    use crate::parallel_letter_frequency;
    use std::collections::HashMap;
    use test::Bencher;

    #[bench]
    fn bench_tiny_parallel(b: &mut Bencher) {
        let tiny = &["a"];
        b.iter(|| parallel_letter_frequency::frequency(tiny, 3));
    }

    #[bench]
    fn bench_tiny_sequential(b: &mut Bencher) {
        let tiny = &["a"];
        b.iter(|| frequency(tiny));
    }

    #[bench]
    fn bench_small_parallel(b: &mut Bencher) {
        let texts = all_texts(1);
        b.iter(|| parallel_letter_frequency::frequency(&texts, 3));
    }

    #[bench]
    fn bench_small_sequential(b: &mut Bencher) {
        let texts = all_texts(1);
        b.iter(|| frequency(&texts));
    }

    #[bench]
    fn bench_large_parallel(b: &mut Bencher) {
        let texts = all_texts(30);
        b.iter(|| parallel_letter_frequency::frequency(&texts, 3));
    }

    #[bench]
    fn bench_large_sequential(b: &mut Bencher) {
        let texts = all_texts(30);
        b.iter(|| frequency(&texts));
    }

    /// Simple sequential char frequency. Can it be beat?
    pub fn frequency(texts: &[&str]) -> HashMap<char, usize> {
        let mut map = HashMap::new();

        for line in texts {
            for chr in line.chars().filter(|c| c.is_alphabetic()) {
                if let Some(c) = chr.to_lowercase().next() {
                    (*map.entry(c).or_insert(0)) += 1;
                }
            }
        }

        map
    }

    fn all_texts(repeat: usize) -> Vec<&'static str> {
        [ODE_AN_DIE_FREUDE, WILHELMUS, STAR_SPANGLED_BANNER]
            .iter()
            .cycle()
            .take(3 * repeat)
            .flat_map(|anthem| anthem.iter().cloned())
            .collect()
    }

    // Poem by Friedrich Schiller. The corresponding music is the European Anthem.
    pub const ODE_AN_DIE_FREUDE: [&'static str; 8] = [
        "Freude schöner Götterfunken",
        "Tochter aus Elysium,",
        "Wir betreten feuertrunken,",
        "Himmlische, dein Heiligtum!",
        "Deine Zauber binden wieder",
        "Was die Mode streng geteilt;",
        "Alle Menschen werden Brüder,",
        "Wo dein sanfter Flügel weilt.",
    ];

    // Dutch national anthem
    pub const WILHELMUS: [&'static str; 8] = [
        "Wilhelmus van Nassouwe",
        "ben ik, van Duitsen bloed,",
        "den vaderland getrouwe",
        "blijf ik tot in den dood.",
        "Een Prinse van Oranje",
        "ben ik, vrij, onverveerd,",
        "den Koning van Hispanje",
        "heb ik altijd geëerd.",
    ];

    // American national anthem
    pub const STAR_SPANGLED_BANNER: [&'static str; 8] = [
        "O say can you see by the dawn's early light,",
        "What so proudly we hailed at the twilight's last gleaming,",
        "Whose broad stripes and bright stars through the perilous fight,",
        "O'er the ramparts we watched, were so gallantly streaming?",
        "And the rockets' red glare, the bombs bursting in air,",
        "Gave proof through the night that our flag was still there;",
        "O say does that star-spangled banner yet wave,",
        "O'er the land of the free and the home of the brave?",
    ];
}

and here are my benchmarks:这是我的基准：

test bench_large_parallel   ... bench:     851,675 ns/iter (+/- 92,416)                                                                                                                                              
test bench_large_sequential ... bench:     839,470 ns/iter (+/- 52,717)                                                                                                                                              
test bench_small_parallel   ... bench:      22,488 ns/iter (+/- 5,062)                                                                                                                                               
test bench_small_sequential ... bench:      28,692 ns/iter (+/- 1,406)                                                                                                                                               
test bench_tiny_parallel    ... bench:          76 ns/iter (+/- 3)                                                                                                                                                   
test bench_tiny_sequential  ... bench:          66 ns/iter (+/- 3)

As you can see the performance of both sequential and parallel are pretty similar...如您所见，顺序和并行的性能非常相似......

I have to copy each chunk before i pass it to the thread so I don't have 'static lifetime issue, that surely impacts performance, how can I address that?我必须在将每个块传递给线程之前复制每个块，所以我没有'static生命周期问题，这肯定会影响性能，我该如何解决这个问题？

I tried to address spawn time overhead by counting small inputs on the main thread, is it good practice?我试图通过计算主线程上的小输入来解决生成时间开销，这是一种好习惯吗？

What else is impacting performance here?还有什么影响这里的性能？ What am I doing wrong?我究竟做错了什么？

Answer 1

This is not a full or particularly nice answer, but only a few random things I noticed.这不是一个完整或特别好的答案，但我注意到的只是一些随机的事情。 Maybe it helps you:也许它可以帮助你：

Due to high iterators work, you are always just creating a thread and immediately joining it again.由于高迭代器的工作，您总是只是创建一个线程并立即再次加入它。 Please try executing this code ( Playground ):请尝试执行此代码（ Playground ）：
```
 (0..3).map(|i| { println,("spawning {}"; i). i }),map(|i| { println;("joining {}". i); i }) .last();
```
The output will probably surprise you. output 可能会让您大吃一惊。 You can read a bit about iterator laziness here , but it doesn't explain this particular situation, unfortunately.您可以在此处阅读有关迭代器惰性的一些信息，但不幸的是，它并不能解释这种特殊情况。
Your parallel and sequential benchmarks don't do the same.您的并行和顺序基准测试不会做同样的事情。 Parallel:平行：
```
 for c in string.as_ref().to_lowercase().chars().filter(|c| c.is_alphabetic()) { *dic.entry(c).or_insert(0) += 1; }
```
Sequential:顺序：
```
 for chr in line.chars().filter(|c| c.is_alphabetic()) { if let Some(c) = chr.to_lowercase().next() { (*map.entry(c).or_insert(0)) += 1; } }
```
In particular, the parallel one calls to_lowercase on a str , returning a String and thus doing at least one heap allocation.特别是，并行调用str上的to_lowercase ，返回一个String并因此至少进行一次堆分配。 That's a big nono.这是一个很大的诺诺。 Furthermore, in the sequential case you only use the first character from the "lowercased" character.此外，在顺序情况下，您只使用“小写”字符中的第一个字符。 That's simply doing something completely different.那只是在做一些完全不同的事情。 It might not matter for your datasets, but just imagine if Mr. Schiller would have used o plus U+0308 ◌̈ COMBINING DIAERESIS instead of U+00F6 LATIN SMALL LETTER O WITH DIAERESIS !这对您的数据集可能无关紧要，但试想一下，如果席勒先生会使用o加U+0308 ◌̈ COMBINING DIAERESIS而不是U+00F6 LATIN SMALL LETTER O WITH DIAERESIS ！ Then you would be in deep trouble .那你就麻烦大了。
Why the strange check of chunk_size and worker_count ?为什么对chunk_size和worker_count进行奇怪的检查？ If I'm doing the math right, for small and tiny you are not even spawning multiple threads because of that check.如果我的数学计算正确，那么对于small和tiny来说，由于该检查，您甚至不会产生多个线程。 So the parallel in those benchmarks' names is a lie.因此，这些基准名称中的parallel是一个谎言。
800µs per iteration is not a lot.每次迭代 800µs 并不是很多。 Creating threads takes time.创建线程需要时间。 Together with all the heap allocations you do to give each thread its own data, the advantage from multithreading is not very high.再加上为每个线程分配自己的数据而进行的所有堆分配，多线程的优势并不是很高。
As you said, lots of heap allocation.正如你所说，很多堆分配。 Most of it can be avoided by using crossbeam::scoped , but since you are not allowed to you that, the next best bet would be to put all data into an Arc and give each thread a range to work in. Since everything (apart from the hash map per thread) is immutable, sharing these things should be easy.使用crossbeam::scoped可以避免大部分问题，但由于不允许这样做，下一个最好的选择是将所有数据放入Arc并为每个线程提供一个工作范围。因为一切（除了从 hash map 每个线程）是不可变的，共享这些东西应该很容易。

There are probably more things to improve, but that's the one I noticed now.可能还有更多需要改进的地方，但这是我现在注意到的。 I hope this helps!我希望这有帮助！

如何改善 vanilla rust 中并行代码的低性能？

问题描述

1 个解决方案

解决方案1
4 2019-11-21 22:06:52

如何改善 vanilla rust 中并行代码的低性能？

问题描述

1 个解决方案

解决方案1 4 2019-11-21 22:06:52

解决方案1
4 2019-11-21 22:06:52