GHC中的交叉模块优化

Question

我有一个非递归函数来计算似乎表现良好的最长公共子序列（ ghc 7.6.1 ，使用-O2 -fllvm标志编译）如果我在同一模块中使用Criterion进行测量。 在另一方面，如果我转换功能为模块，只导出功能（如建议在这里），然后用标准重新测量，我得到〜2倍放缓（这会消失，如果我移动的标准测回模块其中定义了函数）。 我尝试用INLINE编译指示标记函数，这对跨模块性能测量没有任何影响。

在我看来，GHC可能正在进行严格性分析，当函数和main（函数可以从中访问）位于同一模块中时，它可以很好地工作，但是当它们被分割时则不行。 我希望有关如何模块化函数的指针，以便在从其他模块调用时表现良好。 有问题的代码太大了，无法在此处粘贴 - 如果您想尝试一下，可以在此处查看。 我正在尝试做的一个小例子如下（使用代码片段）：

-- Function to find longest common subsequence given unboxed vectors a and b
-- It returns indices of LCS in a and b
lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

-- This section below measures performance of lcs function - if I move it to 
-- a different module, performance degrades ~2x - mean goes from ~1.25us to ~2.4us
-- on my test machine
{-- 
config :: Config
config = defaultConfig  { cfgSamples = ljust 100 }

a = U.fromList ['a'..'j'] :: Vector Char
b = U.fromList ['a'..'k'] :: Vector Char

suite :: [Benchmark]
suite = [
          bench "lcs 10" $ whnf (lcs a) b
        ]

main :: IO()
main = defaultMainWith config (return ()) suite
--}

Answer 1

hammar 是对的，重要的问题是编译器可以看到lcs在可以看到代码的同时使用的类型，因此它可以将代码专门化为该特定类型。

如果编译器不知道代码的使用类型，它不能只生成多态代码。 这对性能不利 - 我很惊讶这里差别只有2倍左右。 多态代码意味着对于许多操作，需要进行类型类查找，并且至少使得无法内联查找函数或常数折叠大小[例如，对于未装箱的数组/向量访问]。

您无法获得与单模块案例相当的性能，并且在单独的模块中实现和使用，而无需在使用站点上显示需要专业化的代码（或者，如果您知道实现站点上需要的类型，那么专门化， {-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-}等。

使用站点上的代码可见通常是通过标记函数{-# INLINABLE #-}展开接口文件中的展开来完成的。

我尝试用INLINE编译指示标记函数，这对跨模块性能测量没有任何影响。

仅标记

lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

INLINE或INLINABLE当然没有区别，这个函数是微不足道的，并且编译器无论如何都会公开它的展开，因为它太小了。 即使它的展开没有暴露，差异也是不可测量的。

你需要公开执行实际工作的函数的展开，至少是多态的函数的展开， lcsh ， findSnakes ， gridWalk和cmp （ cmp是关键的，但其他的是必须的1.看到cmp需要，2。从他们那里调用专门的cmp ）。

制作那些INLINABLE ，单独模块的区别

$ ./diffBench 
warming up
estimating clock resolution...
mean is 1.573571 us (320001 iterations)
found 2846 outliers among 319999 samples (0.9%)
  2182 (0.7%) high severe
estimating cost of a clock call...
mean is 40.54233 ns (12 iterations)

benchmarking lcs 10
mean: 1.628523 us, lb 1.618721 us, ub 1.638985 us, ci 0.950
std dev: 51.75533 ns, lb 47.04237 ns, ub 58.45611 ns, ci 0.950
variance introduced by outliers: 26.787%
variance is moderately inflated by outliers

和单模块的情况

$ ./oneModule 
warming up
estimating clock resolution...
mean is 1.726459 us (320001 iterations)
found 2092 outliers among 319999 samples (0.7%)
  1608 (0.5%) high severe
estimating cost of a clock call...
mean is 39.98567 ns (14 iterations)

benchmarking lcs 10
mean: 1.523183 us, lb 1.514157 us, ub 1.533071 us, ci 0.950
std dev: 48.48541 ns, lb 44.43230 ns, ub 55.04251 ns, ci 0.950
variance introduced by outliers: 26.791%
variance is moderately inflated by outliers

很小。

GHC中的交叉模块优化

问题描述

1 个解决方案

解决方案1
14 已采纳 2013-06-04 08:01:28

GHC中的交叉模块优化

问题描述

1 个解决方案

解决方案1 14 已采纳 2013-06-04 08:01:28

解决方案1
14 已采纳 2013-06-04 08:01:28