How to improve a push data pipeline in C# to match F# in performance

Question

A reoccuring pet project for me is to implement push-based data pipelines in F#. Push pipelines are simpler and faster than pull pipelines like LINQ (although they don't have all capabilities of pull pipelines).

Something that stumped me for awhile is that I don't seem to be implement a push pipeline in C# that is an efficient as my push pipelines in F#. I am looking for input on how to get my C# implementation closer to F#.

A simple push pipeline in F# can be represented like this:

type Receiver<'T> = 'T            -> unit
type Stream<'T>   = Receiver<'T>  -> unit

In C# we could write this:

public delegate void Receiver<in T>(T v);
public delegate void Stream<out T>(Receiver<T> r);

The idea here is that a Stream<> is a function that given a receiver of values calls receiver with all values in the stream.

This allows us to define map aka ´Select` like this in F#:

let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
  fun r -> s (fun v -> r (m v))

C#:

public static Stream<U> Map<T, U>(this Stream<T> t, Func<T, U> m) =>
  r => t(v => r(m(v)));

We can implement other functions until we can define a data pipeline that tests the overhead.

let trivialTest n =
  TrivialStream.range       0 1 n
  |> TrivialStream.map      int64
  |> TrivialStream.filter   (fun v -> v &&& 1L = 0L)
  |> TrivialStream.map      ((+) 1L)
  |> TrivialStream.sum

let trivialTestCs n =
  Stream
    .Range(0,1,n)
    .Map(fun v -> int64 v)
    .Filter(fun v -> v &&& 1L = 0L)
    .Map(fun v -> v + 1L)
    .Sum()

In this pipeline each operation is very cheap so any overhead from the underlying implementation should show up when we measure it.

When comparing 4 different data pipelines, imperative (not really a pipeline but there to sanity check the implementation), trivialpush, trivialpush(C#) and linq these are the numbers on .NET 4.7.1/x64:

Running imperative with total=100000000, outer=1000000, inner=100 ...
  ... 87 ms, cc0=0, cc1=0, cc2=0, result=2601L
Running trivialpush with total=100000000, outer=1000000, inner=100 ...
  ... 414 ms, cc0=53, cc1=0, cc2=0, result=2601L
Running trivialpush(C#) with total=100000000, outer=1000000, inner=100 ...
  ... 1184 ms, cc0=322, cc1=0, cc2=0, result=2601L
Running linq with total=100000000, outer=1000000, inner=100 ...
  ... 2080 ms, cc0=157, cc1=0, cc2=0, result=2601L

The imperative solution is the faster and LINQ begin a pull data pipeline is the slowest. This is expected.

What's not expected is that it seems the F# push pipeline has 3x less overhead than the C# pipeline despite having very similar implementation and used in a similar way.

How do I change the C# data pipeline so that it matches or supersedes the F# data pipeline? I want the API of the data pipeline to be roughly the same.

Update 2018-06-18

@scrwtp asked what happens if I remove inline in F#. Now I added inline in order to get the sum work as intended (in F# inline allows more advanced generics)

Running imperative with total=100000000, outer=1000000, inner=100 ...
  ... 85 ms, cc0=0, cc1=0, cc2=0, result=2601L
Running trivialpush with total=100000000, outer=1000000, inner=100 ...
  ... 773 ms, cc0=106, cc1=0, cc2=0, result=2601L
Running trivialpush(C#) with total=100000000, outer=1000000, inner=100 ...
  ... 1181 ms, cc0=322, cc1=0, cc2=0, result=2601L
Running linq with total=100000000, outer=1000000, inner=100 ...
  ... 2124 ms, cc0=157, cc1=0, cc2=0, result=2601L

This slows down the F# version significantly but it still performs 50% better than my C# stream library.

It's interesting to see that inline has such profound impact on performance when the only thing that is inlined is building up the callback pipeline. Once built up the callback pipeline should look exactly the same.

Update 2018-06-24

I decided to look into detail what is the difference between the F# and C# data pipeline.

Here is how the jitted code for Filter(fun v -> v &&& 1L = 0L) looks for F#:

; TrivialPush, F#, filter operation
00007ffc`b7d01160 488bc2          mov     rax,rdx
; F# inlines the filter function: (fun v -> v &&& 1 = 0L)
; Is even?
00007ffc`b7d01163 a801            test    al,1
00007ffc`b7d01165 7512            jne     00007ffc`b7d01179
; Yes, call next chain in pipeline
; Load pointer next step in pipeline
00007ffc`b7d01167 488b4908        mov     rcx,qword ptr [rcx+8]
; Load Object Method Table
00007ffc`b7d0116b 488b01          mov     rax,qword ptr [rcx]
; Load Table of methods
00007ffc`b7d0116e 488b4040        mov     rax,qword ptr [rax+40h]
; Load address of Invoke
00007ffc`b7d01172 488b4020        mov     rax,qword ptr [rax+20h]
; Jump to Invoke (tail call)
00007ffc`b7d01176 48ffe0          jmp     rax
; No, the number was odd, bail out
00007ffc`b7d01179 33c0            xor     eax,eax
00007ffc`b7d0117b c3              ret

The only real big complaint about this code is that jitter failed to inline the tail call and we end up with a virtual tail call.

Let's look at same data pipeline in C#

; TrivialPush, C#, filter operation
; Method prelude
00007ffc`b75c1a10 57              push    rdi
00007ffc`b75c1a11 56              push    rsi
; Allocate space on stack
00007ffc`b75c1a12 4883ec28        sub     rsp,28h
00007ffc`b75c1a16 488bf1          mov     rsi,rcx
00007ffc`b75c1a19 488bfa          mov     rdi,rdx
; Load pointer test delegate (fun v -> v &&& 1 = 0L)
00007ffc`b75c1a1c 488b4e10        mov     rcx,qword ptr [rsi+10h]
; Load Method Table
00007ffc`b75c1a20 488b4110        mov     rax,qword ptr [rcx+10h]
; Setup this pointer for delegate
00007ffc`b75c1a24 488d4808        lea     rcx,[rax+8]
00007ffc`b75c1a28 488b09          mov     rcx,qword ptr [rcx]
00007ffc`b75c1a2b 488bd7          mov     rdx,rdi
; Load address to Invoke and call
00007ffc`b75c1a2e ff5018          call    qword ptr [rax+18h]
; Did filter return true?
00007ffc`b75c1a31 84c0            test    al,al
00007ffc`b75c1a33 7411            je      00007ffc`b75c1a46
; Yes, call next step in data pipeline
; Load Method Table
00007ffc`b75c1a35 488b4608        mov     rax,qword ptr [rsi+8]
00007ffc`b75c1a39 488d4808        lea     rcx,[rax+8]
; Setup this pointer for delegate
00007ffc`b75c1a3d 488b09          mov     rcx,qword ptr [rcx]
00007ffc`b75c1a40 488bd7          mov     rdx,rdi
; Load address to Invoke and call
00007ffc`b75c1a43 ff5018          call    qword ptr [rax+18h]
; Method prelude epilogue
00007ffc`b75c1a46 90              nop
00007ffc`b75c1a47 4883c428        add     rsp,28h
00007ffc`b75c1a4b 5e              pop     rsi
00007ffc`b75c1a4c 5f              pop     rdi
00007ffc`b75c1a4d c3              ret
; (fun v -> v &&& 1 = 0L) redirect
00007ffc`b75c0408 e963160000      jmp     00007ffc`b75c1a70
; (fun v -> v &&& 1 = 0L)
00007ffc`b75c1a70 488bc2          mov     rax,rdx
; Is even?
00007ffc`b75c1a73 a801            test    al,1
00007ffc`b75c1a75 0f94c0          sete    al
; return result
00007ffc`b75c1a78 0fb6c0          movzx   eax,al
; We are done!
00007ffc`b75c1a7b c3              ret

Compared the F# data pipeline it's easy to see that the code above is more expensive:

F# inlined the test function thus avoiding a virtual call (but why can't the jitter devirtualize this call and inline it for us?)
F# uses tail calls which in this case end up more efficient because we just do a virtual jump rather than virtual call to next step
There is less prelude/epilogue fiddling in the F# jitted code, maybe because of tail-call?
There is an redirect jump between step in the pipeline for the C# jitted code.
The C# code uses delegates rather abstract classes . It seems that delegate invoke is slightly more efficient than abstract class invoke.

In 64 bit mode it seems the main performance benefits comes from

F# inlining the test lambda
F# using tail call (this is not true for 32 bit where tail call kills performance)

We see that the F# data pipelines steps aren't inlined, it's the data pipeline build up code that is inlined. That do however seem to give some performance benefits. Perhaps because information is more easily available to the jitter?

In order to improve the performance of the C# pipeline it seems that I need to structure my C# code so that the jitter devirtualizes and inlines the calls. The jitter has these capabilities but why don't they apply?

Is there a I can structure my F# code so that the tail calls can be devirtualized an inlined?

The full F# console program:

module TrivialStream =
  // A very simple push stream
  type Receiver<'T> = 'T            -> unit
  type Stream<'T>   = Receiver<'T>  -> unit

  module Details =
    module Loop =
      let rec range s e r i = if i <= e then r i; range s e r (i + s)

  open Details

  let inline range b s e : Stream<int> =
    fun r -> Loop.range s e r b

  let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun r -> s (fun v -> if f v then r v)

  let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun r -> s (fun v -> r (m v))

  let inline sum (s : Stream<'T>) : 'T =
    let mutable ss = LanguagePrimitives.GenericZero
    s (fun v -> ss <- ss + v)
    ss

module PerformanceTests =
  open System
  open System.Diagnostics
  open System.IO
  open System.Linq
  open TrivialStreams

  let now =
    let sw = Stopwatch ()
    sw.Start ()
    fun () -> sw.ElapsedMilliseconds

  let time n a =
    let inline cc i       = GC.CollectionCount i

    let v                 = a ()

    GC.Collect (2, GCCollectionMode.Forced, true)

    let bcc0, bcc1, bcc2  = cc 0, cc 1, cc 2
    let b                 = now ()

    for i in 1..n do
      a () |> ignore

    let e = now ()
    let ecc0, ecc1, ecc2  = cc 0, cc 1, cc 2

    v, (e - b), ecc0 - bcc0, ecc1 - bcc1, ecc2 - bcc2

  let trivialTest n =
    TrivialStream.range       0 1 n
    |> TrivialStream.map      int64
    |> TrivialStream.filter   (fun v -> v &&& 1L = 0L)
    |> TrivialStream.map      ((+) 1L)
    |> TrivialStream.sum

  let trivialTestCs n =
    Stream
      .Range(0,1,n)
      .Map(fun v -> int64 v)
      .Filter(fun v -> v &&& 1L = 0L)
      .Map(fun v -> v + 1L)
      .Sum()

  let linqTest n =
    Enumerable
      .Range(0, n + 1)
      .Select(fun v -> int64 v)
      .Where(fun v -> v &&& 1L = 0L)
      .Select(fun v -> v + 1L)
      .Sum()

  let imperativeTest n =
    let rec loop s i =
      if i >= 0L then
        if i &&& 1L = 0L then
          loop (s + i + 1L) (i - 1L)
        else
          loop s (i - 1L)
      else
        s
    loop 0L (int64 n)

  let test () =
    printfn "Running performance tests..."

    let testCases =
      [|
        "imperative"      , imperativeTest
        "trivialpush"     , trivialTest
        "trivialpush(C#)" , trivialTestCs
        "linq"            , linqTest
      |]

    do
      // Just in case tiered compilation is activated on dotnet core 2.1+
      let warmups = 100
      printfn "Warming up..."
      for name, a in testCases do
        time warmups (fun () -> a warmups) |> ignore

    let total   = 100000000
    let outers =
      [|
        10
        1000
        1000000
      |]
    for outer in outers do
      let inner = total / outer
      for name, a in testCases do
        printfn "Running %s with total=%d, outer=%d, inner=%d ..." name total outer inner
        let v, ms, cc0, cc1, cc2 = time outer (fun () -> a inner)
        printfn "  ... %d ms, cc0=%d, cc1=%d, cc2=%d, result=%A" ms cc0 cc1 cc2 v

    printfn "Performance tests completed"

[<EntryPoint>]
let main argv =
  PerformanceTests.test ()
  0

The full C# library:

namespace TrivialStreams
{
  using System;

  public delegate void Receiver<in T>(T v);
  public delegate void Stream<out T>(Receiver<T> r);

  public static class Stream
  {
    public static Stream<int> Range(int b, int s, int e) => 
      r =>
        {
          for(var i = 0; i <= e; i += s)
          {
            r(i);
          }
        };

    public static Stream<T> Filter<T>(this Stream<T> t, Func<T, bool> f) =>
      r => t(v => 
        {
          if (f(v)) r(v);
        });

    public static Stream<U> Map<T, U>(this Stream<T> t, Func<T, U> m) =>
      r => t(v => r(m(v)));

    public static long Sum(this Stream<long> t)
    {
      var sum = 0L;

      t(v => sum += v);

      return sum;
    }
  }
}

Answer 1

The F# compiler will sometimes inline functions without explicit instructions to do so. You can annotate functions with [<MethodImpl(MethodImplOptions.NoInlining)>] to prevent this.

Updating your TrivialStream like this:

open System.Runtime.CompilerServices

[<MethodImpl(MethodImplOptions.NoInlining)>]
let range b s e : Stream<int> =
  fun r -> Loop.range s e r b

[<MethodImpl(MethodImplOptions.NoInlining)>]
let filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
  fun r -> s (fun v -> if f v then r v)

[<MethodImpl(MethodImplOptions.NoInlining)>]
let map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
  fun r -> s (fun v -> r (m v))

[<MethodImpl(MethodImplOptions.NoInlining)>]
let sum (s : Stream<'T>) : 'T =
  let mutable ss = LanguagePrimitives.GenericZero
  s (fun v -> ss <- ss + v)
  ss

and then updating the test itself like this:

open System.Runtime.CompilerServices

[<MethodImpl(MethodImplOptions.NoInlining)>]
let parseToInt64 = int64

[<MethodImpl(MethodImplOptions.NoInlining)>]
let filterImpl = fun v -> v &&& 1L = 0L

[<MethodImpl(MethodImplOptions.NoInlining)>]
let mapImpl = ((+) 1L)

let trivialTest n =

  TrivialStream.range       0 1 n
  |> TrivialStream.map      parseToInt64
  |> TrivialStream.filter   filterImpl
  |> TrivialStream.map      mapImpl
  |> TrivialStream.sum

When run as a 32-bit application, this results in an F# run which is actually slower than the C# version. There is some additional strange behavior going on with tail-calls for the 32-bit version.

For the 64-bit version, these changes bring the F# and C# versions within 15% of each other.

If you swap the F# Receiver and Stream for the C# delegates (or just Action<'t> and Action<Action<'t>> ), then the performance of the two are roughly equivalent, so I suspect that there are additional optimizations using FSharpFunc which are at play.

  open TrivialStreams
  // A very simple push stream
  //type Receiver<'T> = 'T            -> unit
  //type Stream<'T>   = Receiver<'T>  -> unit

  module Details =
    module Loop =
      let rec range s e (r:Receiver<'t> ) i = if i <= e then r.Invoke i; range s e r (i + s)

  open Details
  open System.Runtime.CompilerServices

  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let range b s e =
    Stream<'t>(fun r -> (Loop.range s e r b))

  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let filter f (s : Stream<'T>) =
    Stream<'T>(fun r -> s.Invoke (fun v -> if f v then r.Invoke v))

  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let map m (s : Stream<'T>) =
    Stream<'U>(fun r -> s.Invoke (fun v -> r.Invoke (m v)))

  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let sum (s : Stream<'T>) : 'T =
    let mutable ss = LanguagePrimitives.GenericZero
    s.Invoke (fun v -> ss <- ss + v)
    ss

You can apply a small portion of the F# compiler optimizations to the C# by annotating your methods with the [MethodImpl(MethodImplOptions.AggressiveInlining)] property, but this is only a marginal improvement.

How to improve a push data pipeline in C# to match F# in performance

Question

1 answers

solution1
6 2018-06-20 03:28:45

How to improve a push data pipeline in C# to match F# in performance

Question

1 answers

solution1 6 2018-06-20 03:28:45

solution1
6 2018-06-20 03:28:45