简体   繁体   English

是否有类似于函数式编程循环展开的优化?

[英]Is there an optimization similar to loop unroll for functional programming?

Disclaimer: I know little about ghc compiling pipeline, but I hope to learn some more about it with this post, for example, if comparing imperative vs functional is relevant to code compilation.免责声明:我对 ghc 编译管道知之甚少,但我希望通过这篇文章了解更多关于它的信息,例如,如果比较命令式与函数式与代码编译相关。

As you know, loop unrolling reduces the number of iterations over a loop by duplicating the code inside it.如您所知,循环展开通过复制其中的代码来减少循环的迭代次数。 This improves performance since it reduces the number of jumps (and penalities associated with it) and AFAIR, creates bigger blocks of code, leaving room for better Register renaming optimization.这提高了性能,因为它减少了跳转次数(以及与之相关的惩罚)和 AFAIR,创建了更大的代码块,为更好的寄存器重命名优化留出了空间。

I was wondering, could there be an equivalent to Loop Unrolling for functional programming?我想知道,对于函数式编程,是否有与 Loop Unrolling 等效的东西? Could we 'unroll' a function, open/expand it's definition, to first reduce the number of calls to said function and/or creating bigger chunks of code -- then leaving room for more code rewrite optimizations (like register renaming, or some FP equivalent)?我们可以“展开”一个函数,打开/扩展它的定义,首先减少对所述函数的调用次数和/或创建更大的代码块——然后为更多的代码重写优化留出空间(如寄存器重命名,或一些 FP相等的)?

Something that would 'unroll' or 'expand' a function definition, using for example function evaluation (maybe mixed with some tactic) in order to have a trade-off between space vs time.可以“展开”或“扩展”函数定义的东西,例如使用函数评估(可能与某些策略混合)以便在空间与时间之间进行权衡。

An example of what I had in mind:我想到的一个例子:

 map1 _ [] = []
 map1 f (x:xs) = (f x): map f xs

Would unroll to将展开到

map2 _ [] = []
map2 f (x:x1:xs) = (f x):(f x1):map2 f xs
map2 f (x:xs) = (f x): map2 f xs

Once more:再一次:

map4 _ [] = []
map4 f (x:x1:x2:x3:xs) = (f x):(f x1):(f x2):(f x3):map4 f xs
map4 f (x:x1:x2:xs) = (f x):(f x1):(f x2):map4 f xs
map4 f (x:x1:xs) = (f x):(f x1):map4 f xs
map4 f (x:xs) = (f x): map4 f xs

Two things are at play: multiple cases of map4 (and consequent tests on list) could degrade performance, or the reduced number of calls of map4 would improve performance.有两件事在起作用:map4 的多个案例(以及随后的列表测试)可能会降低性能,或者减少 map4 的调用次数会提高性能。 Maybe this could reduce some constant overhead created by lazy evaluation?也许这可以减少延迟评估造成的一些持续开销?

Well that doesn't seems to hard to code a test for , so after putting criterion to roll this out, this is what I've got:好吧,为 编写测试代码似乎并不难,所以在提出标准之后,这就是我所得到的:

ImgUr album图片专辑

Problem size 5*10^6

map  105.4 ms
map2 93.34 ms
map4 89.79 ms

Problem size 1*10^7

map  216.3 ms
map2 186.8 ms
map4 180.1 ms

Problem size 5*10^7

map  1050 ms
map2 913.7 ms
map4 899.8 ms

Well, it seems that unrolling had some effect^1!好吧,看来展开有一些效果^1! map4 appears to be 16% faster. map4 似乎快了 16%。

Time for the questions then:接下来是提问时间:

  1. Have this been discussed before?以前讨论过这个吗? Is something like that already implemented?这样的事情已经实施了吗?
  2. Is it really the reduced number of evaluations of map4 that improves speed?真的是map4的求值次数的减少提高了速度吗?
  3. Can this be automated?这可以自动化吗?
  4. Could I evaluate by chunks?可以分块评价吗? ie.: if (fx) is fully evaluated, fully evalute everything up to (f x4).即:如果 (fx) 被完全评估,则完全评估直到 (f x4) 的所有内容。
  5. Any other form this sort of unrolling come at play?任何其他形式的这种展开发挥作用?
  6. How inflation on the function size could this lead to?这会导致函数大小的膨胀如何?
  7. Any short-commings on why this is not a good idea?关于为什么这不是一个好主意的任何简短评论?

1: I`ve also unrolled fib, since this sort of optimization would also happen in that form, but the performance gain is cheating a (very) bad algorithm. 1:我也展开了 fib,因为这种优化也会以这种形式发生,但是性能提升欺骗了一个(非常)糟糕的算法。

Did you compile with optimizations? 你用优化编译了吗? For me, with -O2 , there's not really a difference between these snippets: map1 , map2 , and map4 ran in 279, 267, and 285ms, respectively (and for comparison, map itself ran in 278ms). 对我来说,使用-O2 ,这些片段之间并没有什么区别: map1map2map4分别在map4map4运行(为了进行比较, map本身在278ms内运行)。 So that just looks like measurement noise rather than improvement to me. 所以这看起来只是测量噪音而不是改进。

That said, you might like to look at this GHC plugin which seems to be about loop unrolling. 也就是说,您可能希望看看这个GHC插件 ,它似乎是关于循环展开的。

It's sad but pretty true that pure functional languages and imperative languages tend to have very different optimization techniques. 遗憾的是,纯粹的函数式语言和命令式语言往往具有非常不同的优化技术。 For example, you might want to look at stream fusion and deforestation -- two techniques that are pretty neat but don't translate very well to imperative languages. 例如,您可能希望查看流融合和砍伐森林 - 这两种技术非常简洁,但不能很好地转换为命令式语言。

And as for "Any short-commings on why this is not a good idea?", well, I can think of one right off the bat: 至于“为什么这不是一个好主意的任何短暂的交易?”,好吧,我可以想到一个正确的方法:

*Main> map succ (1:undefined)
[2*** Exception: Prelude.undefined
*Main> map4 succ (1:undefined)
*** Exception: Prelude.undefined

In many situations, making a function more strict in order to improve performance is fine, but here the performance win isn't that clear to me and map is often used in laziness-reliant ways. 在许多情况下,为了提高性能而使功能更加严格是好的,但是在这里表现胜利对我来说并不是那么清楚,并且map通常以懒惰的方式使用。

Along with the ghc unrolling plugin already mentioned, there is a page on the GHC trac which discusses peeling/unrolling. 与已经提到的ghc展开插件一起,GHC trac上有一个页面 ,讨论剥离/展开。 The "Open Issues" and "References" sections are particularly revealing sources of further research material. “开放式问题”和“参考文献”部分特别揭示了进一步研究材料的来源。

Loop unrolling is quite a blunt weapon.循环展开是一种相当钝的武器。 I'd never want your map example to be unrolled, for example.例如,我永远不希望您的map示例被展开。 It is entirely dominated by memory allocation of the returned list and the thunks in its cells.它完全由返回列表的内存分配及其单元格中的 thunk 控制。 I doesn't matter whether or not the register allocator has more to chew on.寄存器分配器是否有更多需要咀嚼的内容并不重要。 (Whether or not to unroll a fold like foldl' is perhaps a different question.) (是否展开像foldl'这样的折叠可能是一个不同的问题。)

GHC could achieve loop unrolling by inlining recursive functions. GHC 可以通过内联递归函数来实现循环展开。 But it tries hard not to: In fact it will never inline "loop-breaker"(s) of a recursive group.但它努力不这样做:事实上,它永远不会内联递归组的“循环断路器”。 Otherwise there is no gaurantee that inlining terminates whatsoever.否则,无法保证内联终止。 See Section 4 of "Secrets of the GHC inliner" .请参阅“GHC 内联的秘密”的第 4 节

GHC does apply a limited form of loop peeling (or rather, partial redundancy elimination) in its LiberateCase pass (run with -O2 ): GHC 在其LiberateCase传递(使用-O2运行)中确实应用了有限形式的循环剥离(或者更确切地说,部分冗余消除):

f :: (Int, Int) -> Int
f p = g [0..snd p]
  where
    g :: [Int] -> Int
    g [] = 0
    g (x:xs) = fst p + x + g xs

Here, GHC will peel one iteration of the loop to get the fst p projection out of the loop and reference an unboxed Int# instead.在这里,GHC 将剥离循环的一次迭代,以从循环中取出fst p投影,并改为引用未装箱的Int# Core:核:

Lib.$wf
  = \ (ww_sNF :: Int) (ww1_sNJ :: GHC.Prim.Int#) ->
      case GHC.Enum.eftInt 0# ww1_sNJ of {
        [] -> 0#;
        : x_atE xs_atF ->
          case ww_sNF of { GHC.Types.I# x1_aLW ->
          case x_atE of { GHC.Types.I# y_aLZ ->
          letrec {
            $wg_sNB [InlPrag=NOUSERINLINE[2], Occ=LoopBreaker]
              :: [Int] -> GHC.Prim.Int#
            [LclId, Arity=1, Str=<S,1*U>, Unf=OtherCon []]
            $wg_sNB
              = \ (w_sNx :: [Int]) ->
                  case w_sNx of {
                    [] -> 0#;
                    : x2_Xud xs1_Xuf ->
                      case x2_Xud of { GHC.Types.I# y1_XMG ->
                      case $wg_sNB xs1_Xuf of ww2_sNA { __DEFAULT ->
                      GHC.Prim.+# (GHC.Prim.+# x1_aLW y1_XMG) ww2_sNA
                      }
                      }
                  }; } in
          case $wg_sNB xs_atF of ww2_sNA { __DEFAULT ->
          GHC.Prim.+# (GHC.Prim.+# x1_aLW y_aLZ) ww2_sNA
          }
          }
          }
      }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM