简体   繁体   English

迭代数组与列表的性能

[英]Performance of iterating over Array vs List

Inspired by this question, I wanted to see if there were any performance differences between iterating over an Array vs a List. 这个问题的启发,我想看看迭代一个数组与一个List之间是否有任何性能差异。

Since we would iterating over the entire collection, my initial thought was that there shouldn't really be a performance difference between the two. 由于我们将迭代整个集合,我最初的想法是两者之间不应该存在性能差异。 Furthermore, I thought that using a tail recursive function to do a count should be as fast as just using a mutable variable. 此外,我认为使用尾递归函数进行计数应该与使用可变变量一样快。 However, when I wrote a simple script to test the difference, I found the following (run in Release Mode with VS2015): 但是,当我编写一个简单的脚本来测试差异时,我发现了以下内容(在VS2015的发布模式下运行):

add_k_list, elapsed 15804 ms, result 0L
add_k_list_mutable, elapsed 12800 ms, result 0L
add_k_array, elapsed 15719 ms, result 0L

I'm wonder why the list addition implementation that uses a mutable variable is decently faster than both tail recursive version and the one using a mutable variable and an array. 我想知道为什么使用可变变量的列表添加实现比尾递归版本和使用可变变量和数组的版本快得多。

Here's my code: 这是我的代码:

open System.Diagnostics

let d = 100000
let n = 100000

let stopWatch = 
  let sw = Stopwatch ()
  sw.Start ()
  sw

let testList = [1..d]
let testArray = [|1..d|]

let timeIt (name : string) (a : int ->  int list -> 'T) : unit = 
  let t = stopWatch.ElapsedMilliseconds
  let v = a 0 (testList)
  for i = 1 to (n) do
    a i testList |> ignore
  let d = stopWatch.ElapsedMilliseconds - t
  printfn "%s, elapsed %d ms, result %A" name d v

let timeItArr (name : string) (a : int -> int [] -> 'T) : unit = 
  let t = stopWatch.ElapsedMilliseconds
  let v = a 0 (testArray)
  for i = 1 to (n) do
    a i testArray |> ignore
  let d = stopWatch.ElapsedMilliseconds - t
  printfn "%s, elapsed %d ms, result %A" name d v

let add_k_list x (k_range: int list) =
    let rec add k_range x acc =  
        match k_range with
        | [] -> acc
        | k::ks -> let y = x ^^^ k
                   if (y < k || y > d) then
                       add ks x (acc + 1L)
                   else
                       add ks x acc
    add k_range x 0L


let add_k_list_mutable x (k_range: int list) = 
    let mutable count = 0L
    for k in k_range do
        let y = x ^^^ k
        if (y < k || y > d) then
            count <- count + 1L
    count

let add_k_array x (k_range: int []) =
    let mutable count = 0L
    for k in k_range do
        let y = x ^^^ k
        if (y < k || y > d) then
            count <- count + 1L
    count

[<EntryPoint>]
let main argv = 
    let x = 5
    timeItArr "add_k_array" add_k_array
    timeIt "add_k_list" add_k_list
    timeIt "add_k_list_mutable" add_k_list_mutable
    printfn "%A" argv
    0 // return an integer exit code

EDIT: The above test was run on 32bit Release mode in VS2015. 编辑:上述测试在VS2015的32位发布模式下运行。 At the suggestion of s952163, I ran it at 64 bit and found the results differ quite a bit: 在s952163的建议中,我以64位运行它,发现结果有很大不同:

add_k_list, elapsed 17918 ms, result 0L
add_k_list_mutable, elapsed 17898 ms, result 0L
add_k_array, elapsed 8261 ms, result 0L

I'm especially surprised that the difference between using tail recursion with an accumulator vs a mutable variable seems to have disappeared. 我特别惊讶的是,使用尾递归与累加器和可变变量之间的差异似乎已经消失。

When running a slightly modified program (posted below) these are the numbers I received: 当运行略微修改的程序(下面发布)时,这些是我收到的数字:

x64 Release .NET 4.6.1 x64发布.NET 4.6.1

TestRun: Total: 1000000000, Outer: 100, Inner: 10000000
add_k_array, elapsed 1296 ms, accumulated result 495000099L
add_k_list, elapsed 2675 ms, accumulated result 495000099L
add_k_list_mutable, elapsed 2678 ms, accumulated result 495000099L
TestRun: Total: 1000000000, Outer: 1000, Inner: 1000000
add_k_array, elapsed 869 ms, accumulated result 499624318L
add_k_list, elapsed 2486 ms, accumulated result 499624318L
add_k_list_mutable, elapsed 2483 ms, accumulated result 499624318L
TestRun: Total: 1000000000, Outer: 10000, Inner: 100000
add_k_array, elapsed 750 ms, accumulated result 507000943L
add_k_list, elapsed 1602 ms, accumulated result 507000943L
add_k_list_mutable, elapsed 1603 ms, accumulated result 507000943L

x86 Release .NET 4.6.1 x86发布.NET 4.6.1

TestRun: Total: 1000000000, Outer: 100, Inner: 10000000
add_k_array, elapsed 1601 ms, accumulated result 495000099L
add_k_list, elapsed 2014 ms, accumulated result 495000099L
add_k_list_mutable, elapsed 1835 ms, accumulated result 495000099L
TestRun: Total: 1000000000, Outer: 1000, Inner: 1000000
add_k_array, elapsed 1495 ms, accumulated result 499624318L
add_k_list, elapsed 1714 ms, accumulated result 499624318L
add_k_list_mutable, elapsed 1595 ms, accumulated result 499624318L
TestRun: Total: 1000000000, Outer: 10000, Inner: 100000
add_k_array, elapsed 1363 ms, accumulated result 507000943L
add_k_list, elapsed 1406 ms, accumulated result 507000943L
add_k_list_mutable, elapsed 1221 ms, accumulated result 507000943L

(As usual it's important to not run with the debugger attached as that changes how the JIT:er works. With debugger attached the JIT:er produces code that is easier for the debugger but also slower.) (像往常一样,重要的是不要运行附带的调试器,因为这会改变JIT:er的工作方式。附加调试器的JIT:er生成的代码对调试器来说更容易,但也更慢。)

The way this works is that the total number of iterations is kept constant but it varies the count of the outer loop and the size of the list/array. 这种方式的工作方式是迭代总数保持不变,但它会改变外部循环的计数和列表/数组的大小。

For me the only measurement that is odd is that the array loop is worse in some cases than the list loop. 对我来说,奇怪的唯一测量是阵列循环在某些情况下比列表循环更糟糕。

If the total amount of work is the same why do we see different results when outer/inner is varied? 如果总工作量相同,为什么当外/内变化时我们会看到不同的结果?

The answer is most likely related to the CPU cache. 答案很可能与CPU缓存有关。 When we iterate over an array of size 10,000,000 the actual size of it in memory is 40,000,000 bytes. 当我们迭代一个大小为10,000,000的数组时,它在内存中的实际大小为40,000,000字节。 My machine have "just" 6,000,000 bytes of L3 cache. 我的机器“只有”6,000,000字节的L3缓存。 When the array size is 1,000,000 the size of the array is 4,000,000 bytes which can fit in L3. 当数组大小为1,000,000时,数组的大小为4,000,000字节,可以适合L3。

The list type in F# is essentially a single-linked list and a rough estimate of the list element is 4(data)+8(64bit pointer)+8(vtable pointer)+4(heap overhead) = 24 bytes. F#中的列表类型本质上是单链表,列表元素的粗略估计是4(数据)+8(64位指针)+8(vtable指针)+4(堆开销)= 24字节。 With this estimate the size of a list with 10,000,000 elements is 240,000,000 bytes and with size 1,000,000 elements the size is 24,000,000. 根据此估计,具有10,000,000个元素的列表的大小为240,000,000个字节,大小为1,000,000个元素,大小为24,000,000。 Neither fit in the L3 cache on my machine. 两者都不适合我机器上的L3缓存。

When the number of element is 100,000 the size of the array is 400,000 bytes and the list size is 2,400,000. 当元素数为100,000时,数组的大小为400,000字节,列表大小为2,400,000。 Both fit snugly into the L3 cache. 两者都紧密地适应L3缓存。

This reasoning can explain the difference in performance between smaller array/lists compared to bigger ones. 这种推理可以解释较小的阵列/列表与较大的阵列/列表之间的性能差异。

If the elements for the list is not allocated sequentially (ie the heap is fragmented or the GC moved them around) the performance of the list is expected to be much worse if it doesn't fit into the cache because the CPU prefetch strategy no longer works then. 如果列表的元素没有按顺序分配(即堆碎片或GC移动它们),如果它不适合缓存,则预计列表的性能会更差,因为CPU预取策略不再然后工作。 The elements in an array is guaranteed to always be sequential thus prefetch will work fine if you iterate sequentially. 数组中的元素保证始终是顺序的,因此如果按顺序迭代,预取将正常工作。

Why is tail-recursion slower than the mutable for loop? 为什么尾递归比可变for循环慢?

This actually isn't true in F# 3 where the for loop is expected to be much slower than the tail-recursion. 在F#3中实际上并非如此,其中for循环预期比尾递归慢得多。

For an hint of the answer I used ILSpy to look at the generated IL code. 为了提示答案,我使用ILSpy来查看生成的IL代码。

I found that FSharpList<>::get_TailOrNull() is called twice per loop when using tail-recursion. 我发现在使用尾递归时,每个循环调用FSharpList<>::get_TailOrNull()两次。 Once to check if we reached the end and once to get the next element (redundant call). 一次检查我们是否到达结束并且一次获得下一个元素(冗余调用)。

The for loop version only calls FSharpList<>::get_TailOrNull() once. for循环版本只调用FSharpList<>::get_TailOrNull()一次。

This extra call likely explains why tail-recursion is slower but as you noted in x64 bit mode both list versions were about as fast. 这个额外的调用可能解释了为什么尾递归较慢但是你在x64位模式中注意到这两个列表版本的速度一样快。 What's going on? 这是怎么回事?

I checked the JIT:ed assembly code and noted that the x64 JIT:er eliminated the extra call to FSharpList<>::get_TailOrNull() . 我检查了JIT:ed汇编代码并注意到x64 JIT:er消除了对FSharpList<>::get_TailOrNull()的额外调用。 The x86 JIT:er failed to eliminate the call. x86 JIT:呃无法消除呼叫。

Lastly, why is array version slower than the list version on x86? 最后,为什么数组版本比x86上的列表版本慢?

In general I expect that arrays to have the least overhead of all collection in .NET. 一般来说,我希望数组在.NET中所有集合的开销最小。 The reason is that it's compact, sequential and there are special instructions in ILAsm to access the elements. 原因是它紧凑,顺序,并且ILAsm中有特殊指令来访问元素。

So it's suprising to me that lists performs better in some cases. 因此,令我感到惊讶的是,列表在某些情况下表现更好。

Checking the assembly code again what it seems to come from that the array version requires an extra variable to perform its work and the x86 CPU has few registers available leading to an extra read from the stack per iteration. 再次检查汇编代码似乎来自数组版本需要额外的变量来执行其工作,而x86 CPU几乎没有可用的寄存器导致每次迭代从堆栈中额外读取。 x64 has significantly more registers leading to that the array version only has to read once from memory per iteration where the list version reads twice (head and tail). x64具有明显更多的寄存器,导致数组版本每次迭代只需从内存中读取一次,其中列表版本读取两次(头部和尾部)。

Any conclusions? 任何结论?

  • When it comes to CPU performance x64 is the way to go (this hasn't always been the case) 说到CPU性能x64是要走的路(这并非总是如此)
  • Expects arrays to perform better than any data structure in .NET for operations where array operations are O(1) (inserts are slow obviously) 对于阵列操作为O(1)的操作,期望数组的性能优于.NET中的任何数据结构(插入很明显)
  • The devil is in the details meaning in order to gain true insight we might need to check the assembly code. 魔鬼具有细节意义,为了获得真正的洞察力,我们可能需要检查汇编代码。
  • Cache locality is very important for large collections. 缓存局部性对于大型集合非常重要。 Since arrays are compact and guaranteed to be sequential they are often a good choice. 由于阵列紧凑且保证顺序,因此它们通常是一个不错的选择。
  • It's very difficult to predict performance, always measure 预测性能非常困难,总是要衡量
  • Iterate towards zero when possible if you are really hungry for performance. 如果你真的渴望性能,那么在可能的情况下迭代为零。 This can save one read from memory. 这可以节省一次内存读取。

EDIT: OP wondered why it seemed x86 lists performed better x64 lists 编辑:OP想知道为什么看起来x86列表表现更好的x64列表

I reran the perf tests with list/array size set to 1,000. 我重新进行了perf测试,列表/数组大小设置为1,000。 This will make sure the entire data structure fit into my L1 Cache (256kB) 这将确保整个数据结构适合我的L1缓存(256kB)

x64 Release .NET 4.6.1 x64发布.NET 4.6.1

TestRun: Total: 1000000000, Outer: 1000000, Inner: 1000
add_k_array, elapsed 1062 ms, accumulated result 999499999L
add_k_list, elapsed 1134 ms, accumulated result 999499999L
add_k_list_mutable, elapsed 1110 ms, accumulated result 999499999L

x86 Release .NET 4.6.1 x86发布.NET 4.6.1

TestRun: Total: 1000000000, Outer: 1000000, Inner: 1000
add_k_array, elapsed 1617 ms, accumulated result 999499999L
add_k_list, elapsed 1359 ms, accumulated result 999499999L
add_k_list_mutable, elapsed 1100 ms, accumulated result 999499999L

We see that for this size it seems x64 is performing about as well or better than x86. 我们看到,对于这个大小,似乎x64的表现与x86一样好或更好。 Why do we see the opposite in the other measurements? 为什么我们在其他测量中看到相反的情况? I speculate that this is because the size of the list elements are larger in the x64 versions meaning we use more bandwidth moving data from L3 to L1. 我推测这是因为x64版本中列表元素的大小更大,这意味着我们使用更多带宽将数据从L3移动到L1。

So if you can try to make sure your data fits into L1 cache. 因此,如果您可以尝试确保您的数据适合L1缓存。

Final musings 最后的沉思

When working with these sort of questions I sometimes wonder if the whole Von Neumann architecture is a big mistake. 在处理这些问题时,我有时想知道整个冯·诺依曼架构是否是一个大错误。 Instead we should have a data flow architecture as data is slow and instructions are fast. 相反,我们应该有一个数据流架构,因为数据很慢而且指令很快。

AFAIK under the hood CPU:s have a data flow architecture. 引擎盖下的AFAIK CPU:具有数据流架构。 The assembly language though looks like one would expect from a Von Neumann architecture so in some sense it's a high-level abstraction over the data flow architecture. 汇编语言虽然看起来像Von Neumann架构所期望的,但在某种意义上它是数据流架构的高级抽象。 But in order to provide reasonable performant code the CPU die is mostly occupied by cache (~95%). 但是为了提供合理的性能代码,CPU芯片主要由高速缓存占用(~95%)。 With a pure data flow architecture one would expect a higher percentage of the CPU die would do actual work. 使用纯数据流体系结构,人们可以预期更高比例的CPU芯片可以完成实际工作。

Hope this was interesting, my modified program follows: 希望这很有趣,我修改后的程序如下:

open System.Diagnostics

let stopWatch =
  let sw = Stopwatch ()
  sw.Start ()
  sw

let timeIt (name : string) (outer : int) (a : int -> int64) : unit =
  let t = stopWatch.ElapsedMilliseconds
  let mutable acc = a 0
  for i = 2 to outer do
    acc <- acc + a i
  let d = stopWatch.ElapsedMilliseconds - t
  printfn "%s, elapsed %d ms, accumulated result %A" name d acc

let add_k_list x l (k_range: int list) =
    let rec add k_range x acc =
        match k_range with
        | [] -> acc
        | k::ks -> let y = x ^^^ k
                   if (y < k || y > l) then
                       add ks x (acc + 1L)
                   else
                       add ks x acc
    add k_range x 0L


let add_k_list_mutable x l (k_range: int list) =
    let mutable count = 0L
    for k in k_range do
        let y = x ^^^ k
        if (y < k || y > l) then
            count <- count + 1L
    count

let add_k_array x l (k_range: int []) =
    let mutable count = 0L
    for k in k_range do
        let y = x ^^^ k
        if (y < k || y > l) then
            count <- count + 1L
    count
[<EntryPoint>]
let main argv =
  let total = 1000000000
  let outers = [|100; 1000; 10000|]

  for outer in outers do
    let inner = total / outer
    printfn "TestRun: Total: %d, Outer: %d, Inner: %d" total outer inner

    ignore <| System.GC.WaitForFullGCComplete ()

    let testList  = [1..inner]
    let testArray = [|1..inner|]

    timeIt    "add_k_array"         outer <| fun x -> add_k_array         x inner testArray
    timeIt    "add_k_list"          outer <| fun x -> add_k_list          x inner testList
    timeIt    "add_k_list_mutable"  outer <| fun x -> add_k_list_mutable  x inner testList

  0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM