如何在C＃中改进推送数据管道以在性能上匹配F＃

Question

A reoccuring pet project for me is to implement push-based data pipelines in F#. 对我来说，一个reoccuring宠物项目是在F＃中实现基于推送的数据管道。 Push pipelines are simpler and faster than pull pipelines like LINQ (although they don't have all capabilities of pull pipelines). 推送管道比LINQ之类的拉管道更简单，更快（尽管它们没有拉管道的所有功能）。

Something that stumped me for awhile is that I don't seem to be implement a push pipeline in C# that is an efficient as my push pipelines in F#. 让我困惑一段时间的事情是，我似乎没有在C＃中实现推送管道，这在F＃中作为我的推送管道是有效的。 I am looking for input on how to get my C# implementation closer to F#. 我正在寻找有关如何让我的C＃实现更接近F＃的输入。

A simple push pipeline in F# can be represented like this: F＃中的简单推送管道可以表示如下：

type Receiver<'T> = 'T            -> unit
type Stream<'T>   = Receiver<'T>  -> unit

In C# we could write this: 在C＃中我们可以这样写：

public delegate void Receiver<in T>(T v);
public delegate void Stream<out T>(Receiver<T> r);

The idea here is that a Stream<> is a function that given a receiver of values calls receiver with all values in the stream. 这里的想法是Stream<>是一个函数，它给一个值接收器调用接收器，其中包含流中的所有值。

This allows us to define map aka ´Select` like this in F#: 这允许我们在F＃中定义这样的map #Select`：

let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
  fun r -> s (fun v -> r (m v))

C#: C＃：

public static Stream<U> Map<T, U>(this Stream<T> t, Func<T, U> m) =>
  r => t(v => r(m(v)));

We can implement other functions until we can define a data pipeline that tests the overhead. 我们可以实现其他功能，直到我们可以定义测试开销的数据管道。

let trivialTest n =
  TrivialStream.range       0 1 n
  |> TrivialStream.map      int64
  |> TrivialStream.filter   (fun v -> v &&& 1L = 0L)
  |> TrivialStream.map      ((+) 1L)
  |> TrivialStream.sum

let trivialTestCs n =
  Stream
    .Range(0,1,n)
    .Map(fun v -> int64 v)
    .Filter(fun v -> v &&& 1L = 0L)
    .Map(fun v -> v + 1L)
    .Sum()

In this pipeline each operation is very cheap so any overhead from the underlying implementation should show up when we measure it. 在此管道中，每个操作都非常便宜，因此在我们测量它时，底层实现的任何开销都会显示出来。

When comparing 4 different data pipelines, imperative (not really a pipeline but there to sanity check the implementation), trivialpush, trivialpush(C#) and linq these are the numbers on .NET 4.7.1/x64: 当比较4个不同的数据流水线时，命令式（不是真正的流水线，但有理由检查实现），trivialpush，trivialpush（C＃）和linq这些是.NET 4.7.1 / x64上的数字：

Running imperative with total=100000000, outer=1000000, inner=100 ...
  ... 87 ms, cc0=0, cc1=0, cc2=0, result=2601L
Running trivialpush with total=100000000, outer=1000000, inner=100 ...
  ... 414 ms, cc0=53, cc1=0, cc2=0, result=2601L
Running trivialpush(C#) with total=100000000, outer=1000000, inner=100 ...
  ... 1184 ms, cc0=322, cc1=0, cc2=0, result=2601L
Running linq with total=100000000, outer=1000000, inner=100 ...
  ... 2080 ms, cc0=157, cc1=0, cc2=0, result=2601L

The imperative solution is the faster and LINQ begin a pull data pipeline is the slowest. 必要的解决方案是更快和LINQ开始拉数据管道是最慢的。 This is expected. 这是预料之中的。

What's not expected is that it seems the F# push pipeline has 3x less overhead than the C# pipeline despite having very similar implementation and used in a similar way. 不期望的是，尽管具有非常相似的实现并以类似的方式使用，但F＃push管道的开销似乎比C＃管道少3倍。

How do I change the C# data pipeline so that it matches or supersedes the F# data pipeline? 如何更改C＃数据管道以匹配或取代F＃数据管道？ I want the API of the data pipeline to be roughly the same. 我希望数据管道的API大致相同。

Update 2018-06-18 更新2018-06-18

@scrwtp asked what happens if I remove inline in F#. @scrwtp询问如果我在F＃中删除inline联会发生什么。 Now I added inline in order to get the sum work as intended (in F# inline allows more advanced generics) 现在我添加inline以使sum按预期工作（在F＃ inline允许更高级的泛型）

Running imperative with total=100000000, outer=1000000, inner=100 ...
  ... 85 ms, cc0=0, cc1=0, cc2=0, result=2601L
Running trivialpush with total=100000000, outer=1000000, inner=100 ...
  ... 773 ms, cc0=106, cc1=0, cc2=0, result=2601L
Running trivialpush(C#) with total=100000000, outer=1000000, inner=100 ...
  ... 1181 ms, cc0=322, cc1=0, cc2=0, result=2601L
Running linq with total=100000000, outer=1000000, inner=100 ...
  ... 2124 ms, cc0=157, cc1=0, cc2=0, result=2601L

This slows down the F# version significantly but it still performs 50% better than my C# stream library. 这显着减慢了F＃版本，但它仍然比我的C＃流库好50％。

It's interesting to see that inline has such profound impact on performance when the only thing that is inlined is building up the callback pipeline. 有趣的是，当内联的唯一内容是构建回调管道时， inline对性能有如此深远的影响。 Once built up the callback pipeline should look exactly the same. 构建完成后，回调管道应该看起来完全一样。

Update 2018-06-24 更新2018-06-24

I decided to look into detail what is the difference between the F# and C# data pipeline. 我决定详细研究F＃和C＃数据管道之间的区别。

Here is how the jitted code for Filter(fun v -> v &&& 1L = 0L) looks for F#: 以下是Filter(fun v -> v &&& 1L = 0L)的jitted代码如何查找F＃：

; TrivialPush, F#, filter operation
00007ffc`b7d01160 488bc2          mov     rax,rdx
; F# inlines the filter function: (fun v -> v &&& 1 = 0L)
; Is even?
00007ffc`b7d01163 a801            test    al,1
00007ffc`b7d01165 7512            jne     00007ffc`b7d01179
; Yes, call next chain in pipeline
; Load pointer next step in pipeline
00007ffc`b7d01167 488b4908        mov     rcx,qword ptr [rcx+8]
; Load Object Method Table
00007ffc`b7d0116b 488b01          mov     rax,qword ptr [rcx]
; Load Table of methods
00007ffc`b7d0116e 488b4040        mov     rax,qword ptr [rax+40h]
; Load address of Invoke
00007ffc`b7d01172 488b4020        mov     rax,qword ptr [rax+20h]
; Jump to Invoke (tail call)
00007ffc`b7d01176 48ffe0          jmp     rax
; No, the number was odd, bail out
00007ffc`b7d01179 33c0            xor     eax,eax
00007ffc`b7d0117b c3              ret

The only real big complaint about this code is that jitter failed to inline the tail call and we end up with a virtual tail call. 关于这段代码唯一真正的大抱怨是抖动无法内联尾部调用，我们最终得到一个虚拟尾调用。

Let's look at same data pipeline in C# 让我们看看C＃中的相同数据管道

; TrivialPush, C#, filter operation
; Method prelude
00007ffc`b75c1a10 57              push    rdi
00007ffc`b75c1a11 56              push    rsi
; Allocate space on stack
00007ffc`b75c1a12 4883ec28        sub     rsp,28h
00007ffc`b75c1a16 488bf1          mov     rsi,rcx
00007ffc`b75c1a19 488bfa          mov     rdi,rdx
; Load pointer test delegate (fun v -> v &&& 1 = 0L)
00007ffc`b75c1a1c 488b4e10        mov     rcx,qword ptr [rsi+10h]
; Load Method Table
00007ffc`b75c1a20 488b4110        mov     rax,qword ptr [rcx+10h]
; Setup this pointer for delegate
00007ffc`b75c1a24 488d4808        lea     rcx,[rax+8]
00007ffc`b75c1a28 488b09          mov     rcx,qword ptr [rcx]
00007ffc`b75c1a2b 488bd7          mov     rdx,rdi
; Load address to Invoke and call
00007ffc`b75c1a2e ff5018          call    qword ptr [rax+18h]
; Did filter return true?
00007ffc`b75c1a31 84c0            test    al,al
00007ffc`b75c1a33 7411            je      00007ffc`b75c1a46
; Yes, call next step in data pipeline
; Load Method Table
00007ffc`b75c1a35 488b4608        mov     rax,qword ptr [rsi+8]
00007ffc`b75c1a39 488d4808        lea     rcx,[rax+8]
; Setup this pointer for delegate
00007ffc`b75c1a3d 488b09          mov     rcx,qword ptr [rcx]
00007ffc`b75c1a40 488bd7          mov     rdx,rdi
; Load address to Invoke and call
00007ffc`b75c1a43 ff5018          call    qword ptr [rax+18h]
; Method prelude epilogue
00007ffc`b75c1a46 90              nop
00007ffc`b75c1a47 4883c428        add     rsp,28h
00007ffc`b75c1a4b 5e              pop     rsi
00007ffc`b75c1a4c 5f              pop     rdi
00007ffc`b75c1a4d c3              ret
; (fun v -> v &&& 1 = 0L) redirect
00007ffc`b75c0408 e963160000      jmp     00007ffc`b75c1a70
; (fun v -> v &&& 1 = 0L)
00007ffc`b75c1a70 488bc2          mov     rax,rdx
; Is even?
00007ffc`b75c1a73 a801            test    al,1
00007ffc`b75c1a75 0f94c0          sete    al
; return result
00007ffc`b75c1a78 0fb6c0          movzx   eax,al
; We are done!
00007ffc`b75c1a7b c3              ret

Compared the F# data pipeline it's easy to see that the code above is more expensive: 比较F＃数据管道，很容易看出上面的代码更昂贵：

F# inlined the test function thus avoiding a virtual call (but why can't the jitter devirtualize this call and inline it for us?) F＃内联了测试功能，从而避免了虚拟呼叫（但是为什么抖动不能虚拟化这个呼叫并为我们内联呢？）
F# uses tail calls which in this case end up more efficient because we just do a virtual jump rather than virtual call to next step F＃使用尾调用，在这种情况下最终效率更高，因为我们只是进行虚拟跳转而不是下一步的虚拟调用
There is less prelude/epilogue fiddling in the F# jitted code, maybe because of tail-call? 在F＃jitted代码中摆弄的前奏/尾声较少，可能是因为尾部调用？
There is an redirect jump between step in the pipeline for the C# jitted code. 在C＃jitted代码的管道中的步骤之间存在重定向跳转。
The C# code uses delegates rather abstract classes . C＃代码使用委托而不是抽象类。 It seems that delegate invoke is slightly more efficient than abstract class invoke. 似乎委托调用比抽象类调用稍微有效一些。

In 64 bit mode it seems the main performance benefits comes from 在64位模式下，似乎主要的性能优势来自于

F# inlining the test lambda F＃内联测试lambda
F# using tail call (this is not true for 32 bit where tail call kills performance) F＃使用尾调用（对于32位尾调用会导致性能下降不是这样）

We see that the F# data pipelines steps aren't inlined, it's the data pipeline build up code that is inlined. 我们看到F＃数据管道步骤没有内联，它是内联的数据管道构建代码。 That do however seem to give some performance benefits. 然而，这似乎带来了一些性能上的好处。 Perhaps because information is more easily available to the jitter? 也许是因为抖动更容易获得信息？

In order to improve the performance of the C# pipeline it seems that I need to structure my C# code so that the jitter devirtualizes and inlines the calls. 为了提高C＃管道的性能，似乎我需要构建我的C＃代码，以便抖动虚拟化并内联调用。 The jitter has these capabilities but why don't they apply? 抖动具有这些功能，但为什么它们不适用？

Is there a I can structure my F# code so that the tail calls can be devirtualized an inlined? 有没有我可以构建我的F＃代码，以便尾部调用可以被内联的虚拟化？

The full F# console program: 完整的F＃控制台程序：

module TrivialStream =
  // A very simple push stream
  type Receiver<'T> = 'T            -> unit
  type Stream<'T>   = Receiver<'T>  -> unit

  module Details =
    module Loop =
      let rec range s e r i = if i <= e then r i; range s e r (i + s)

  open Details

  let inline range b s e : Stream<int> =
    fun r -> Loop.range s e r b

  let inline filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
    fun r -> s (fun v -> if f v then r v)

  let inline map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
    fun r -> s (fun v -> r (m v))

  let inline sum (s : Stream<'T>) : 'T =
    let mutable ss = LanguagePrimitives.GenericZero
    s (fun v -> ss <- ss + v)
    ss

module PerformanceTests =
  open System
  open System.Diagnostics
  open System.IO
  open System.Linq
  open TrivialStreams

  let now =
    let sw = Stopwatch ()
    sw.Start ()
    fun () -> sw.ElapsedMilliseconds

  let time n a =
    let inline cc i       = GC.CollectionCount i

    let v                 = a ()

    GC.Collect (2, GCCollectionMode.Forced, true)

    let bcc0, bcc1, bcc2  = cc 0, cc 1, cc 2
    let b                 = now ()

    for i in 1..n do
      a () |> ignore

    let e = now ()
    let ecc0, ecc1, ecc2  = cc 0, cc 1, cc 2

    v, (e - b), ecc0 - bcc0, ecc1 - bcc1, ecc2 - bcc2

  let trivialTest n =
    TrivialStream.range       0 1 n
    |> TrivialStream.map      int64
    |> TrivialStream.filter   (fun v -> v &&& 1L = 0L)
    |> TrivialStream.map      ((+) 1L)
    |> TrivialStream.sum

  let trivialTestCs n =
    Stream
      .Range(0,1,n)
      .Map(fun v -> int64 v)
      .Filter(fun v -> v &&& 1L = 0L)
      .Map(fun v -> v + 1L)
      .Sum()

  let linqTest n =
    Enumerable
      .Range(0, n + 1)
      .Select(fun v -> int64 v)
      .Where(fun v -> v &&& 1L = 0L)
      .Select(fun v -> v + 1L)
      .Sum()

  let imperativeTest n =
    let rec loop s i =
      if i >= 0L then
        if i &&& 1L = 0L then
          loop (s + i + 1L) (i - 1L)
        else
          loop s (i - 1L)
      else
        s
    loop 0L (int64 n)

  let test () =
    printfn "Running performance tests..."

    let testCases =
      [|
        "imperative"      , imperativeTest
        "trivialpush"     , trivialTest
        "trivialpush(C#)" , trivialTestCs
        "linq"            , linqTest
      |]

    do
      // Just in case tiered compilation is activated on dotnet core 2.1+
      let warmups = 100
      printfn "Warming up..."
      for name, a in testCases do
        time warmups (fun () -> a warmups) |> ignore

    let total   = 100000000
    let outers =
      [|
        10
        1000
        1000000
      |]
    for outer in outers do
      let inner = total / outer
      for name, a in testCases do
        printfn "Running %s with total=%d, outer=%d, inner=%d ..." name total outer inner
        let v, ms, cc0, cc1, cc2 = time outer (fun () -> a inner)
        printfn "  ... %d ms, cc0=%d, cc1=%d, cc2=%d, result=%A" ms cc0 cc1 cc2 v

    printfn "Performance tests completed"

[<EntryPoint>]
let main argv =
  PerformanceTests.test ()
  0

The full C# library: 完整的C＃库：

namespace TrivialStreams
{
  using System;

  public delegate void Receiver<in T>(T v);
  public delegate void Stream<out T>(Receiver<T> r);

  public static class Stream
  {
    public static Stream<int> Range(int b, int s, int e) => 
      r =>
        {
          for(var i = 0; i <= e; i += s)
          {
            r(i);
          }
        };

    public static Stream<T> Filter<T>(this Stream<T> t, Func<T, bool> f) =>
      r => t(v => 
        {
          if (f(v)) r(v);
        });

    public static Stream<U> Map<T, U>(this Stream<T> t, Func<T, U> m) =>
      r => t(v => r(m(v)));

    public static long Sum(this Stream<long> t)
    {
      var sum = 0L;

      t(v => sum += v);

      return sum;
    }
  }
}

Answer 1

The F# compiler will sometimes inline functions without explicit instructions to do so. F＃编译器有时会在没有明确指示的情况下内联函数。 You can annotate functions with [<MethodImpl(MethodImplOptions.NoInlining)>] to prevent this. 您可以使用[<MethodImpl(MethodImplOptions.NoInlining)>]注释函数以防止这种情况。

Updating your TrivialStream like this: 像这样更新你的TrivialStream ：

open System.Runtime.CompilerServices

[<MethodImpl(MethodImplOptions.NoInlining)>]
let range b s e : Stream<int> =
  fun r -> Loop.range s e r b

[<MethodImpl(MethodImplOptions.NoInlining)>]
let filter (f : 'T -> bool) (s : Stream<'T>) : Stream<'T> =
  fun r -> s (fun v -> if f v then r v)

[<MethodImpl(MethodImplOptions.NoInlining)>]
let map (m : 'T -> 'U) (s : Stream<'T>) : Stream<'U> =
  fun r -> s (fun v -> r (m v))

[<MethodImpl(MethodImplOptions.NoInlining)>]
let sum (s : Stream<'T>) : 'T =
  let mutable ss = LanguagePrimitives.GenericZero
  s (fun v -> ss <- ss + v)
  ss

and then updating the test itself like this: 然后像这样更新测试本身：

open System.Runtime.CompilerServices

[<MethodImpl(MethodImplOptions.NoInlining)>]
let parseToInt64 = int64

[<MethodImpl(MethodImplOptions.NoInlining)>]
let filterImpl = fun v -> v &&& 1L = 0L

[<MethodImpl(MethodImplOptions.NoInlining)>]
let mapImpl = ((+) 1L)

let trivialTest n =

  TrivialStream.range       0 1 n
  |> TrivialStream.map      parseToInt64
  |> TrivialStream.filter   filterImpl
  |> TrivialStream.map      mapImpl
  |> TrivialStream.sum

When run as a 32-bit application, this results in an F# run which is actually slower than the C# version. 当作为32位应用程序运行时，这会导致F＃运行实际上比C＃版本慢。 There is some additional strange behavior going on with tail-calls for the 32-bit version. 对于32位版本的尾调用还有一些额外的奇怪行为。

For the 64-bit version, these changes bring the F# and C# versions within 15% of each other. 对于64位版本，这些更改使F＃和C＃版本相互之间的比例达到15％。

If you swap the F# Receiver and Stream for the C# delegates (or just Action<'t> and Action<Action<'t>> ), then the performance of the two are roughly equivalent, so I suspect that there are additional optimizations using FSharpFunc which are at play. 如果你为C＃ Receiver交换F＃ Receiver和Stream （或者只是Action<'t>和Action<Action<'t>> ），那么两者的性能大致相当，所以我怀疑还有其他优化使用FSharpFunc正在发挥作用。

  open TrivialStreams
  // A very simple push stream
  //type Receiver<'T> = 'T            -> unit
  //type Stream<'T>   = Receiver<'T>  -> unit

  module Details =
    module Loop =
      let rec range s e (r:Receiver<'t> ) i = if i <= e then r.Invoke i; range s e r (i + s)

  open Details
  open System.Runtime.CompilerServices

  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let range b s e =
    Stream<'t>(fun r -> (Loop.range s e r b))

  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let filter f (s : Stream<'T>) =
    Stream<'T>(fun r -> s.Invoke (fun v -> if f v then r.Invoke v))

  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let map m (s : Stream<'T>) =
    Stream<'U>(fun r -> s.Invoke (fun v -> r.Invoke (m v)))

  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let sum (s : Stream<'T>) : 'T =
    let mutable ss = LanguagePrimitives.GenericZero
    s.Invoke (fun v -> ss <- ss + v)
    ss

You can apply a small portion of the F# compiler optimizations to the C# by annotating your methods with the [MethodImpl(MethodImplOptions.AggressiveInlining)] property, but this is only a marginal improvement. 您可以通过使用[MethodImpl(MethodImplOptions.AggressiveInlining)]属性注释方法，将一小部分F＃编译器优化应用于C＃，但这只是一个微小的改进。

如何在C＃中改进推送数据管道以在性能上匹配F＃

问题描述

1 个解决方案

解决方案1
6 2018-06-20 03:28:45

如何在C＃中改进推送数据管道以在性能上匹配F＃

问题描述

1 个解决方案

解决方案1 6 2018-06-20 03:28:45

解决方案1
6 2018-06-20 03:28:45