scala implicit performance

Question

This comes up regularly. Functions coded up using generics are signifficnatly slower in scala. See example below. Type specific version performs about a 1/3 faster than the generic version. This is doubly surprising given that the generic component is outside of the expensive loop. Is there a known explanation for this?

  def xxxx_flttn[T](v: Array[Array[T]])(implicit m: Manifest[T]): Array[T] = {
    val I = v.length
    if (I <= 0) Array.ofDim[T](0)
    else {
      val J = v(0).length
      for (i <- 1 until I) if (v(i).length != J) throw new utl_err("2D matrix not symetric. cannot be flattened. first row has " + J + " elements. row " + i + " has " + v(i).length)
      val flt = Array.ofDim[T](I * J)
      for (i <- 0 until I; j <- 0 until J) flt(i * J + j) = v(i)(j)
      flt
    }
  }
  def flttn(v: Array[Array[Double]]): Array[Double] = {
    val I = v.length
    if (I <= 0) Array.ofDim[Double](0)
    else {
      val J = v(0).length
      for (i <- 1 until I) if (v(i).length != J) throw new utl_err("2D matrix not symetric. cannot be flattened. first row has " + J + " elements. row " + i + " has " + v(i).length)
      val flt = Array.ofDim[Double](I * J)
      for (i <- 0 until I; j <- 0 until J) flt(i * J + j) = v(i)(j)
      flt
    }
  }

Answer 1

This is due to boxing, when you apply the generic to a primitive type and use containing arrays (or the type appearing plain in method signatures or as member).

Example

In the following trait, after compilation, the process method will take an erased Array[Any] .

trait Foo[A]{
  def process(as: Array[A]): Int
}

If you choose A to be a value/primitive type, like Double it has to be boxed. When writing the trait in a non-generic way (eg with A=Double ), process is compiled to take an Array[Double] , which is a distinct array type on the JVM. This is more efficient, since in order to store a Double inside the Array[Any] , the Double has to be wrapped (boxed) into an object, a reference to which gets stored inside the array. The special Array[Double] can store the Double directly in memory as a 64-Bit value.

The `@specialized` -Annotation

If you feel adventerous, you can try the @specialized keyword (it's pretty buggy and crashes the compiler often). This makes scalac compile special versions of a class for all or selected primitive types. This only makes sense, if the type parameter appears plain in type signatures ( get(a: A) , but not get(as: Seq[A]) ) or as a type paramter to Array . I think you'll receive a warning if speicialization is pointless.

Answer 2

You can't really tell what you're measuring here--not very well, anyway--because the for loop isn't as fast as a pure while loop, and the inner operation is quite inexpensive. If we rewrite the code with while loops--the key double-iteration being

 var i = 0
  while (i<I) {
    var j = 0
    while (j<J) {
      flt(i * J + j) = v(i)(j)
      j += 1
    }
    i += 1
  }
  flt

then we see that the bytecode for the generic case is actually dramatically different. Non-generic:

133:    checkcast   #174; //class "[D"
136:    astore  6
138:    iconst_0
139:    istore  5
141:    iload   5
143:    iload_2
144:    if_icmpge   191
147:    iconst_0
148:    istore  4
150:    iload   4
152:    iload_3
153:    if_icmpge   182
// The stuff above implements the loop; now we do the real work
156:    aload   6
158:    iload   5
160:    iload_3
161:    imul
162:    iload   4
164:    iadd
165:    aload_1
166:    iload   5
168:    aaload             // v(i)
169:    iload   4
171:    daload             // v(i)(j)
172:    dastore            // flt(.) = _
173:    iload   4
175:    iconst_1
176:    iadd
177:    istore  4
// Okay, done with the inner work, time to jump around
179:    goto    150
182:    iload   5
184:    iconst_1
185:    iadd
186:    istore  5
188:    goto    141

It's just a bunch of jumps and low-level operations (daload and dastore being the key ones that load and store a double from an array). If we look at the key inner part of the generic bytecode, it instead looks like

160:    getstatic   #30; //Field scala/runtime/ScalaRunTime$.MODULE$:Lscala/runtime/ScalaRunTime$;
163:    aload   7
165:    iload   6
167:    iload   4
169:    imul
170:    iload   5
172:    iadd
173:    getstatic   #30; //Field scala/runtime/ScalaRunTime$.MODULE$:Lscala/runtime/ScalaRunTime$;
176:    aload_1
177:    iload   6
179:    aaload
180:    iload   5
182:    invokevirtual   #107; //Method scala/runtime/ScalaRunTime$.array_apply:(Ljava/lang/Object;I)Ljava/lang/Object;
185:    invokevirtual   #111; //Method scala/runtime/ScalaRunTime$.array_update:(Ljava/lang/Object;ILjava/lang/Object;)V
188:    iload   5
190:    iconst_1
191:    iadd
192:    istore  5

which, as you can see, has to call methods to do the array apply and update. The bytecode for that is a huge mess of stuff like

2:   aload_3 
3:   instanceof      #98; //class "[Ljava/lang/Object;"
6:   ifeq    18
9:   aload_3   
10:  checkcast       #98; //class "[Ljava/lang/Object;"
13:  iload_2
14:  aaload 
15:  goto    183
18:  aload_3
19:  instanceof      #100; //class "[I"
22:  ifeq    37
25:  aload_3   
26:  checkcast       #100; //class "[I"
29:  iload_2
30:  iaload 
31:  invokestatic    #106; //Method scala/runtime/BoxesRunTime.boxToInteger:
34:  goto    183
37:  aload_3
38:  instanceof      #108; //class "[D"
41:  ifeq    56
44:  aload_3   
45:  checkcast       #108; //class "[D"
48:  iload_2
49:  daload 
50:  invokestatic    #112; //Method scala/runtime/BoxesRunTime.boxToDouble:(
53:  goto    183

which basically has to test each type of array and box it if it's the type you're looking for. Double is pretty near the front (3rd of 10), but it's still a pretty major overhead, even if the JVM can recognize that the code ends up being box/unbox and therefore doesn't actually need to allocate memory. (I'm not sure it can do that, but even if it could it wouldn't solve the problem.)

So, what to do? You can try [@specialized T], which will expand your code tenfold for you, as if you wrote each primitive array operation by yourself. Specialization is buggy in 2.9 (should be less so in 2.10), though, so it may not work the way you hope. If speed is of the essence--well, first, write while loops instead of for loops (or at least compile with -optimise which helps for loops out by a factor of two or so!), and then consider either specialization or writing the code by hand for the types you require.

scala implicit performance

Question

2 answers

solution1
5 2012-07-09 14:43:08

Example

The `@specialized` -Annotation

solution2
4 ACCPTED 2012-07-09 15:53:29

scala implicit performance

Question

2 answers

solution1 5 2012-07-09 14:43:08

Example

The @specialized -Annotation

solution2 4 ACCPTED 2012-07-09 15:53:29

solution1
5 2012-07-09 14:43:08

The `@specialized` -Annotation

solution2
4 ACCPTED 2012-07-09 15:53:29