Reading the Zero-cost abstractions and looking at Introduction to rust: a low-level language with high-level abstractions I tried to compare two approaches to computing the dot product of a vector: one using a for loop and one using iterators.
#![feature(test)]
extern crate rand;
extern crate test;
use std::cmp::min;
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum::<f64>()
}
#[cfg(test)]
mod bench {
use test::Bencher;
use rand::{Rng,thread_rng};
use super::*;
const LEN: usize = 30;
#[test]
fn test_1() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_1(&x, &y);
assert_eq!(result, 28.0);
}
#[test]
fn test_2() {
let x = [1.0, 2.0, 3.0];
let y = [2.0, 4.0, 6.0];
let result = dot_product_2(&x, &y);
assert_eq!(result, 28.0);
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen::<f64>()).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_1(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2*LEN as u32);
b.iter(|| {
dot_product_2(&samples[0..LEN], &samples[LEN..2*LEN])
})
}
}
The later of the links above claims that the version with the iterators should have similar performance "and actually be a little bit faster". However, when benchmarking the two, I get very different results:
running 2 tests
test bench::bench_small_loop ... bench: 20 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 24 ns/iter (+/- 2)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out
So, where did the "zero-cost abstraction" go?
Update: Adding the foldr
example provided by @wimh and using split_at
instead of slices give the following result.
running 3 tests
test bench::bench_small_fold ... bench: 18 ns/iter (+/- 1)
test bench::bench_small_iter ... bench: 21 ns/iter (+/- 1)
test bench::bench_small_loop ... bench: 24 ns/iter (+/- 1)
test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out
So it seems that the additional time comes directly or indirectly from constructing the slices inside the measured code. To check that it indeed was the case, I tried the following two approaches with the same result (here shown for foldr
case and using map
+ sum
):
#[bench]
fn bench_small_iter(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let s0 = &samples[0..LEN];
let s1 = &samples[LEN..2 * LEN];
b.iter(|| dot_product_iter(s0, s1))
}
#[bench]
fn bench_small_fold(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_fold(s0, s1))
}
It seems zero cost to me. I wrote your code slightly more idiomatically, using the same random values for both tests, and then tested multiple times:
fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
result
}
fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: usize) -> Vec<f64> {
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
rng.sample_iter(&rand::distributions::Standard).take(cnt).collect()
}
#[bench]
fn bench_small_1(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_1(s0, s1))
}
#[bench]
fn bench_small_2(b: &mut Bencher) {
let samples = rand_array(2 * LEN);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_2(s0, s1))
}
bench_small_1 20 ns/iter (+/- 6)
bench_small_2 17 ns/iter (+/- 1)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 17 ns/iter (+/- 2)
bench_small_1 19 ns/iter (+/- 5)
bench_small_2 17 ns/iter (+/- 3)
bench_small_1 19 ns/iter (+/- 3)
bench_small_2 24 ns/iter (+/- 7)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 24 ns/iter (+/- 1)
bench_small_1 27 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 0)
bench_small_2 25 ns/iter (+/- 1)
bench_small_1 28 ns/iter (+/- 1)
bench_small_2 17 ns/iter (+/- 1)
In 9 of the 10 runs, the idiomatic code was faster than the for loop. This was done on 2.9 GHz Core i9 (I9-8950HK) with 32 GB RAM, compiled with rustc 1.31.0-nightly (fc403ad98 2018-09-30)
.
For fun, I rewrote the benchmark to use criterion , a port of the Haskell benchmarking library.
Cargo.toml
[package]
name = "mats-zero-cost-rust"
version = "0.1.0"
authors = ["mats"]
[dev-dependencies]
criterion = "0.2"
rand = "0.4"
[[bench]]
name = "my_benchmark"
harness = false
benches/my_benchmark.rs
#[macro_use]
extern crate criterion;
extern crate rand;
use std::cmp::min;
use criterion::Criterion;
use rand::{thread_rng, Rng};
const LEN: usize = 30;
fn dot_product_loop(x: &[f64], y: &[f64]) -> f64 {
let mut result: f64 = 0.0;
for i in 0..min(x.len(), y.len()) {
result += x[i] * y[i];
}
return result;
}
fn dot_product_iter(x: &[f64], y: &[f64]) -> f64 {
x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: u32) -> Vec<f64> {
let mut rng = thread_rng();
(0..cnt).map(|_| rng.gen()).collect()
}
fn criterion_loop_with_slice(c: &mut Criterion) {
c.bench_function("loop with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_loop(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_loop_without_slice(c: &mut Criterion) {
c.bench_function("loop without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_loop(s0, s1))
});
}
fn criterion_iter_with_slice(c: &mut Criterion) {
c.bench_function("iterators with slice", |b| {
let samples = rand_array(2 * LEN as u32);
b.iter(|| dot_product_iter(&samples[0..LEN], &samples[LEN..2 * LEN]))
});
}
fn criterion_iter_without_slice(c: &mut Criterion) {
c.bench_function("iterators without slice", |b| {
let samples = rand_array(2 * LEN as u32);
let (s0, s1) = samples.split_at(LEN);
b.iter(|| dot_product_iter(s0, s1))
});
}
criterion_group!(benches, criterion_loop_with_slice, criterion_loop_without_slice, criterion_iter_with_slice, criterion_iter_without_slice);
criterion_main!(benches);
I observe these results;
kolmodin@blin:~/code/mats-zero-cost-rust$ cargo bench
Compiling mats-zero-cost-rust v0.1.0 (/home/kolmodin/code/mats-zero-cost-rust)
Finished release [optimized] target(s) in 1.16s
Running target/release/deps/my_benchmark-6f00e042fd40bc1d
Gnuplot not found, disabling plotting
loop with slice time: [7.5794 ns 7.6843 ns 7.8016 ns]
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) high mild
13 (13.00%) high severe
loop without slice time: [24.384 ns 24.486 ns 24.589 ns]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low severe
1 (1.00%) low mild
iterators with slice time: [13.842 ns 13.852 ns 13.863 ns]
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
4 (4.00%) high mild
1 (1.00%) high severe
iterators without slice time: [13.420 ns 13.424 ns 13.430 ns]
Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) low mild
1 (1.00%) high mild
10 (10.00%) high severe
Gnuplot not found, disabling plotting
Using rustc 1.30.0 (da5f414c2 2018-10-24)
on an AMD Ryzen 7 2700X.
The iterator implementation gets similar results for using slice
and split_at
.
The loop implementation gets very different results. The version with slice is significantly faster.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.