简体   繁体   English

Scala的隐藏性能成本?

[英]Hidden performance cost in Scala?

I came across this old question and did the following experiment with scala 2.10.3. 我遇到了这个老问题 ,并使用scala 2.10.3进行了以下实验。

I rewrote the Scala version to use explicit tail recursion: 我重写了Scala版本以使用显式尾递归:

import scala.annotation.tailrec

object ScalaMain {
  private val t = 20

  private def run() {
    var i = 10
    while(!isEvenlyDivisible(2, i, t))
      i += 2
    println(i)
  }

  @tailrec private def isEvenlyDivisible(i: Int, a: Int, b: Int): Boolean = {
    if (i > b) true
    else (a % i == 0) && isEvenlyDivisible(i+1, a, b)
  }

  def main(args: Array[String]) {
    val t1 = System.currentTimeMillis()
    var i = 0
    while (i < 20) {
      run()
      i += 1
    }
    val t2 = System.currentTimeMillis()
    println("time: " + (t2 - t1))
  }
}

and compared it to the following Java version. 并将其与以下Java版本进行比较。 I consciously made the functions non-static for a fair comparison with Scala: 为了与Scala公平比较,我有意识地使函数非静态:

public class JavaMain {
    private final int t = 20;

    private void run() {
        int i = 10;
        while (!isEvenlyDivisible(2, i, t))
            i += 2;
        System.out.println(i);
    }

    private boolean isEvenlyDivisible(int i, int a, int b) {
        if (i > b) return true;
        else return (a % i == 0) && isEvenlyDivisible(i+1, a, b);
    }

    public static void main(String[] args) {
        JavaMain o = new JavaMain();
        long t1 = System.currentTimeMillis();
        for (int i = 0; i < 20; ++i)
          o.run();
        long t2 = System.currentTimeMillis();
        System.out.println("time: " + (t2 - t1));
    }
}

Here are the results on my computer: 以下是我的计算机上的结果:

> java JavaMain
....
time: 9651
> scala ScalaMain
....
time: 20592

This is scala 2.10.3 on (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51). 这是scala 2.10.3 on(Java HotSpot(TM)64位服务器VM,Java 1.7.0_51)。

My question is what is the hidden cost with the scala version? 我的问题是scala版本的隐藏成本是多少?

Many thanks. 非常感谢。

Well, OP's benchmarking is not the ideal one. 那么,OP的基准测试并不理想。 Tons of effects need to be mitigated, including warmup, dead code elimination, forking, etc. Luckily, JMH already takes care of many things, and has bindings for both Java and Scala. 需要减少大量的影响,包括预热,死代码消除,分叉等。幸运的是, JMH已经处理了很多事情,并且对Java和Scala都有绑定。 Please follow the procedures on JMH page to get the benchmark project, then you can transplant the benchmarks below there. 请按照JMH页面上的程序获取基准测试项目,然后您可以移植下面的基准测试。

This is the sample Java benchmark: 这是示例Java基准测试:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(3)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
public class JavaBench {

    @Param({"1", "5", "10", "15", "20"})
    int t;

    private int run() {
        int i = 10;
        while(!isEvenlyDivisible(2, i, t))
            i += 2;
        return i;
    }

    private boolean isEvenlyDivisible(int i, int a, int b) {
        if (i > b)
            return true;
        else
            return (a % i == 0) && isEvenlyDivisible(i + 1, a, b);
    }

    @GenerateMicroBenchmark
    public int test() {
        return run();
    }

}

...and this is the sample Scala benchmark: ......这是示例Scala基准测试:

@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(3)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
class ScalaBench {

  @Param(Array("1", "5", "10", "15", "20"))
  var t: Int = _

  private def run(): Int = {
    var i = 10
    while(!isEvenlyDivisible(2, i, t))
      i += 2
    i
  }

  @tailrec private def isEvenlyDivisible(i: Int, a: Int, b: Int): Boolean = {
    if (i > b) true
    else (a % i == 0) && isEvenlyDivisible(i + 1, a, b)
  }

  @GenerateMicroBenchmark
  def test(): Int = {
    run()
  }

}

If you run these on JDK 8 GA, Linux x86_64, then you'll get: 如果你在JDK 8 GA,Linux x86_64上运行它们,那么你会得到:

Benchmark             (t)   Mode   Samples         Mean   Mean error    Units
o.s.ScalaBench.test     1   avgt        15        0.005        0.000    us/op
o.s.ScalaBench.test     5   avgt        15        0.489        0.001    us/op
o.s.ScalaBench.test    10   avgt        15       23.672        0.087    us/op
o.s.ScalaBench.test    15   avgt        15     3406.492        9.239    us/op
o.s.ScalaBench.test    20   avgt        15  2483221.694     5973.236    us/op

Benchmark            (t)   Mode   Samples         Mean   Mean error    Units
o.s.JavaBench.test     1   avgt        15        0.002        0.000    us/op
o.s.JavaBench.test     5   avgt        15        0.254        0.007    us/op
o.s.JavaBench.test    10   avgt        15       12.578        0.098    us/op
o.s.JavaBench.test    15   avgt        15     1628.694       11.282    us/op
o.s.JavaBench.test    20   avgt        15  1066113.157    11274.385    us/op

Notice we juggle t to see if the effect is local for the particular value of t . 注意我们玩弄t以查看效果是否是t的特定值的局部效应。 It is not, the effect is systematic, and Java version being twice as fast. 它不是,效果是系统的,Java版本的速度是其两倍。

PrintAssembly will shed some light on this. PrintAssembly将对此有所了解。 This one is the hottest block in Scala benchmark: 这是Scala基准测试中最热门的一个:

0x00007fe759199d42: test   %r8d,%r8d
0x00007fe759199d45: je     0x00007fe759199d76  ;*irem
                                               ; - org.sample.ScalaBench::isEvenlyDivisible@11 (line 52)
                                               ; - org.sample.ScalaBench::run@10 (line 45)
0x00007fe759199d47: mov    %ecx,%eax
0x00007fe759199d49: cmp    $0x80000000,%eax
0x00007fe759199d4e: jne    0x00007fe759199d58
0x00007fe759199d50: xor    %edx,%edx
0x00007fe759199d52: cmp    $0xffffffffffffffff,%r8d
0x00007fe759199d56: je     0x00007fe759199d5c
0x00007fe759199d58: cltd   
0x00007fe759199d59: idiv   %r8d

...and this is similar block in Java: ...这是Java中类似的块:

0x00007f4a811848cf: movslq %ebp,%r10
0x00007f4a811848d2: mov    %ebp,%r9d
0x00007f4a811848d5: sar    $0x1f,%r9d
0x00007f4a811848d9: imul   $0x55555556,%r10,%r10
0x00007f4a811848e0: sar    $0x20,%r10
0x00007f4a811848e4: mov    %r10d,%r11d
0x00007f4a811848e7: sub    %r9d,%r11d         ;*irem
                                              ; - org.sample.JavaBench::isEvenlyDivisible@9 (line 63)
                                              ; - org.sample.JavaBench::isEvenlyDivisible@19 (line 63)
                                              ; - org.sample.JavaBench::run@10 (line 54)

Notice how in Java version the compiler employed the trick for translating integer remainder calculation into the multiplication and shifting right (see Hacker's Delight, Ch. 10, Sect. 19). 请注意,在Java版本中,编译器如何使用技巧将整数余数计算转换为乘法和右移(参见Hacker's Delight,Ch.10,Sect.19)。 This is possible when compiler detects we compute the remainder against the constant, which suggests Java version hit that sweet optimization, but Scala version did not. 当编译器检测到我们根据常量计算余数时,这是可能的,这表明Java版本达到了甜蜜的优化,但Scala版本却没有。 You can dig into the bytecode disassembly to figure out what quirk in scalac have intervened, but the point of this exercise is that surprising minute differences in code generation are magnified by benchmarks a lot. 您可以深入研究字节码反汇编以找出scalac中干预的怪癖,但本练习的重点是代码生成中令人惊讶的微小差异被基准放大了很多。

PS So much for @tailrec ... PS @tailrec ......

UPDATE: A more thorough explanation of the effect: http://shipilev.net/blog/2014/java-scala-divided-we-fail/ 更新:对效果的更全面的解释: http//shipilev.net/blog/2014/java-scala-divided-we-fail/

I changed the val 我改变了val

private val t = 20

to a constant definition 一个恒定的定义

private final val t = 20

and got a significant performance boost, now it seems that both versions perform almost equally [on my system, see update and comments]. 并且得到了显着的性能提升,现在似乎两个版本的表现几乎相同[在我的系统上,请参阅更新和评论]。

I have not looked into into the bytecode, but if you use val t = 20 you can see using javap that there is a method (and that version is as slow as the one with the private val ). 我没有研究字节码,但如果使用val t = 20你可以看到使用javap有一个方法(并且该版本与private val版本一样慢)。

So I assume that even a private val involves calling a method, and that's not directly comparable with a final in Java. 所以我假设即使是private val涉及调用一个方法,而这与Java中的final无法直接比较。

Update 更新

On my system I got these results 在我的系统上,我得到了这些结果

Java version : time: 14725 Java版本:时间:14725

Scala version: time: 13228 Scala版本:时间:13228

Using OpenJDK 1.7 on a 32-Bit Linux. 在32位Linux上使用OpenJDK 1.7。

In my experience Oracle's JDK on a 64-Bit system does actually perform better, so this probably explains that other measurements yield even better results in favour of the Scala version. 根据我的经验,64位系统上的Oracle JDK实际上表现更好,因此这可能解释了其他测量结果会产生更好的结果,有利于Scala版本。

As for the Scala version performing better I assume that tail recursion optimization does have an effect here (see Phil's answer, if the Java version is rewritten to use a loop instead of recursion, it performs equally again). 至于Scala版本表现更好我假设尾递归优化确实在这里产生影响(参见Phil的答案,如果Java版本被重写为使用循环而不是递归,它会再次执行)。

I looked at this question and edited the Scala version to have t inside run : 我看了看这个问题和编辑的斯卡拉版本有trun

object ScalaMain {
  private def run() {
    val t = 20
    var i = 10
    while(!isEvenlyDivisible(2, i, t))
      i += 2
    println(i)
  }

  @tailrec private def isEvenlyDivisible(i: Int, a: Int, b: Int): Boolean = {
    if (i > b) true
    else (a % i == 0) && isEvenlyDivisible(i+1, a, b)
  }

  def main(args: Array[String]) {
    val t1 = System.currentTimeMillis()
    var i = 0
    while (i < 20) {
      run()
      i += 1
    }
    val t2 = System.currentTimeMillis()
    println("time: " + (t2 - t1))
  }
}

The new Scala version now runs twice as fast as the original Java one: 新的Scala版本现在运行速度是原始Java版本的两倍:

> fsc ScalaMain.scala
> scala ScalaMain
....
time: 6373
> fsc -optimize ScalaMain.scala
....
time: 4703

I figured out it is because Java not having tail calls. 我发现这是因为Java没有尾调用。 The optimized Java one with loop instead of recursion runs just as fast: 优化的Java with loop而不是递归运行速度同样快:

public class JavaMain {
    private static final int t = 20;

    private void run() {
        int i = 10;
        while (!isEvenlyDivisible(i, t))
            i += 2;
        System.out.println(i);
    }

    private boolean isEvenlyDivisible(int a, int b) {
        for (int i = 2; i <= b; ++i) {
            if (a % i != 0)
                 return false;
        }
        return true;
    }

    public static void main(String[] args) {
        JavaMain o = new JavaMain();
        long t1 = System.currentTimeMillis();
        for (int i = 0; i < 20; ++i)
            o.run();
        long t2 = System.currentTimeMillis();
        System.out.println("time: " + (t2 - t1));
    }
}

Now my confusion is fully solved: 现在我的困惑完全解决了:

> java JavaMain
....
time: 4795

In conclusion, the original Scala version was slow because I didn't declare t to be final (directly or indirectly, as Beryllium 's answer points out). 总之,最初的版本斯卡拉是缓慢的,因为我没有申报tfinal (直接或间接地为答案指出)。 And the original Java version was slow due to lack of tail calls. 由于缺少尾调用,原始Java版本很慢。

To make the Java version completely equivalent to your Scala code you need to change it like this. 要使Java版本完全等同于Scala代码,您需要像这样更改它。

private int t = 20;


private int t() {
    return this.t;
}

private void run() {
    int i = 10;
    while (!isEvenlyDivisible(2, i, t()))
        i += 2;
    System.out.println(i);
}

It is slower because the JVM can not optimize the method calls. 它较慢,因为JVM无法优化方法调用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM