简体   繁体   English

GHC中的交叉模块优化

[英]Cross module optimizations in GHC

I have a non-recursive function to calculate longest common subsequence that seems to perform well ( ghc 7.6.1 , compiled with -O2 -fllvm flags) if I measure it with Criterion in the same module. 我有一个非递归函数来计算似乎表现良好的最长公共子序列( ghc 7.6.1 ,使用-O2 -fllvm标志编译)如果我在同一模块中使用Criterion进行测量。 On the other hand, if I convert the function into a module, export just that function (as recommended here ), and then measure again with Criterion, I get ~2x slowdown (which goes away if I move the criterion test back to the module where function is defined). 在另一方面,如果我转换功能为模块,只导出功能(如建议在这里 ),然后用标准重新测量,我得到〜2倍放缓(这会消失,如果我移动的标准测回模块其中定义了函数)。 I tried marking the function with INLINE pragma which didn't make any difference in cross-module performance measurements. 我尝试用INLINE编译指示标记函数,这对跨模块性能测量没有任何影响。

It seems to me that GHC might be doing a strictness analysis that works well when the function and the main (from which the function is reachable) are in the same module, but not when they are split across. 在我看来,GHC可能正在进行严格性分析,当函数和main(函数可以从中访问)位于同一模块中时,它可以很好地工作,但是当它们被分割时则不行。 I would appreciate pointers on how to modularize the function so that it performs well when called from other modules. 我希望有关如何模块化函数的指针,以便在从其他模块调用时表现良好。 The code in question is too big to paste here - you can see it here if you want to try it out. 有问题的代码太大了,无法在此处粘贴 - 如果您想尝试一下,可以在此处查看。 A small example of what I am trying to do is below (with snippets of code): 我正在尝试做的一个小例子如下(使用代码片段):

-- Function to find longest common subsequence given unboxed vectors a and b
-- It returns indices of LCS in a and b
lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

-- This section below measures performance of lcs function - if I move it to 
-- a different module, performance degrades ~2x - mean goes from ~1.25us to ~2.4us
-- on my test machine
{-- 
config :: Config
config = defaultConfig  { cfgSamples = ljust 100 }

a = U.fromList ['a'..'j'] :: Vector Char
b = U.fromList ['a'..'k'] :: Vector Char

suite :: [Benchmark]
suite = [
          bench "lcs 10" $ whnf (lcs a) b
        ]

main :: IO()
main = defaultMainWith config (return ()) suite
--}

hammar is right , the important issue is that the compiler can see the type that lcs is used at at the same time as it can see the code , so it can specialise the code to that particular type. hammar 是对的 ,重要的问题是编译器可以看到lcs可以看到代码的同时使用的类型,因此它可以将代码专门化为该特定类型。

If the compiler doesn't know the type at which the code shall be used, it cannot but only produce polymorphic code. 如果编译器不知道代码的使用类型,它不能只生成多态代码。 And that is bad for performance - I'm rather surprised it's only a ~2× difference here. 这对性能不利 - 我很惊讶这里差别只有2倍左右。 Polymorphic code means that for many operations a type-class lookup is needed, and that at least makes it impossible to inline the looked-up function or constant-fold sizes [eg for unboxed array/vector access]. 多态代码意味着对于许多操作,需要进行类型类查找,并且至少使得无法内联查找函数或常数折叠大小[例如,对于未装箱的数组/向量访问]。

You cannot obtain comparable performance to the one-module case with implementation and use in separate modules without making the code that needs specialising visible at the use site (or, if you know the needed types at the implementation site, specialising there, {-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-} etc.). 您无法获得与单模块案例相当的性能,并且在单独的模块中实现和使用,而无需在使用站点上显示需要专业化的代码(或者,如果您知道实现站点上需要的类型,那么专门化, {-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-}等。

Making the code visible at the use-site is usually done by exposing the unfolding in the interface file via marking the function {-# INLINABLE #-} . 使用站点上的代码可见通常是通过标记函数{-# INLINABLE #-}展开接口文件中的展开来完成的。

I tried marking the function with INLINE pragma which didn't make any difference in cross-module performance measurements. 我尝试用INLINE编译指示标记函数,这对跨模块性能测量没有任何影响。

Marking only 仅标记

lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

INLINE or INLINABLE doesn't make a difference of course, that function is trivial, and the compiler exposes its unfolding anyway, since it's so small. INLINEINLINABLE当然没有区别,这个函数是微不足道的,并且编译器无论如何都会公开它的展开,因为它太小了。 Even if its unfolding wasn't exposed, the difference would not be measurable. 即使它的展开没有暴露,差异也是不可测量的。

You need to expose the unfoldings of the functions doing the actual work too, at least that of the polymorphic ones, lcsh , findSnakes , gridWalk and cmp ( cmp is the one that's crucial here, but the others are necessary to 1. see that cmp is needed, 2. call the specialised cmp from them). 你需要公开执行实际工作的函数的展开,至少是多态的函数的展开, lcshfindSnakesgridWalkcmpcmp是关键的,但其他的是必须的1.看到cmp需要,2。从他们那里调用专门的cmp )。

Making those INLINABLE , the difference between the separate-module case 制作那些INLINABLE ,单独模块的区别

$ ./diffBench 
warming up
estimating clock resolution...
mean is 1.573571 us (320001 iterations)
found 2846 outliers among 319999 samples (0.9%)
  2182 (0.7%) high severe
estimating cost of a clock call...
mean is 40.54233 ns (12 iterations)

benchmarking lcs 10
mean: 1.628523 us, lb 1.618721 us, ub 1.638985 us, ci 0.950
std dev: 51.75533 ns, lb 47.04237 ns, ub 58.45611 ns, ci 0.950
variance introduced by outliers: 26.787%
variance is moderately inflated by outliers

and the single-module case 和单模块的情况

$ ./oneModule 
warming up
estimating clock resolution...
mean is 1.726459 us (320001 iterations)
found 2092 outliers among 319999 samples (0.7%)
  1608 (0.5%) high severe
estimating cost of a clock call...
mean is 39.98567 ns (14 iterations)

benchmarking lcs 10
mean: 1.523183 us, lb 1.514157 us, ub 1.533071 us, ci 0.950
std dev: 48.48541 ns, lb 44.43230 ns, ub 55.04251 ns, ci 0.950
variance introduced by outliers: 26.791%
variance is moderately inflated by outliers

is bearably small. 很小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM