GHC中的交叉模塊優化

Question

我有一個非遞歸函數來計算似乎表現良好的最長公共子序列（ ghc 7.6.1 ，使用-O2 -fllvm標志編譯）如果我在同一模塊中使用Criterion進行測量。 在另一方面，如果我轉換功能為模塊，只導出功能（如建議在這里），然后用標准重新測量，我得到〜2倍放緩（這會消失，如果我移動的標准測回模塊其中定義了函數）。 我嘗試用INLINE編譯指示標記函數，這對跨模塊性能測量沒有任何影響。

在我看來，GHC可能正在進行嚴格性分析，當函數和main（函數可以從中訪問）位於同一模塊中時，它可以很好地工作，但是當它們被分割時則不行。 我希望有關如何模塊化函數的指針，以便在從其他模塊調用時表現良好。 有問題的代碼太大了，無法在此處粘貼 - 如果您想嘗試一下，可以在此處查看。 我正在嘗試做的一個小例子如下（使用代碼片段）：

-- Function to find longest common subsequence given unboxed vectors a and b
-- It returns indices of LCS in a and b
lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

-- This section below measures performance of lcs function - if I move it to 
-- a different module, performance degrades ~2x - mean goes from ~1.25us to ~2.4us
-- on my test machine
{-- 
config :: Config
config = defaultConfig  { cfgSamples = ljust 100 }

a = U.fromList ['a'..'j'] :: Vector Char
b = U.fromList ['a'..'k'] :: Vector Char

suite :: [Benchmark]
suite = [
          bench "lcs 10" $ whnf (lcs a) b
        ]

main :: IO()
main = defaultMainWith config (return ()) suite
--}

Answer 1

hammar 是對的，重要的問題是編譯器可以看到lcs在可以看到代碼的同時使用的類型，因此它可以將代碼專門化為該特定類型。

如果編譯器不知道代碼的使用類型，它不能只生成多態代碼。 這對性能不利 - 我很驚訝這里差別只有2倍左右。 多態代碼意味着對於許多操作，需要進行類型類查找，並且至少使得無法內聯查找函數或常數折疊大小[例如，對於未裝箱的數組/向量訪問]。

您無法獲得與單模塊案例相當的性能，並且在單獨的模塊中實現和使用，而無需在使用站點上顯示需要專業化的代碼（或者，如果您知道實現站點上需要的類型，那么專門化， {-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-}等。

使用站點上的代碼可見通常是通過標記函數{-# INLINABLE #-}展開接口文件中的展開來完成的。

我嘗試用INLINE編譯指示標記函數，這對跨模塊性能測量沒有任何影響。

僅標記

lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

INLINE或INLINABLE當然沒有區別，這個函數是微不足道的，並且編譯器無論如何都會公開它的展開，因為它太小了。 即使它的展開沒有暴露，差異也是不可測量的。

你需要公開執行實際工作的函數的展開，至少是多態的函數的展開， lcsh ， findSnakes ， gridWalk和cmp （ cmp是關鍵的，但其他的是必須的1.看到cmp需要，2。從他們那里調用專門的cmp ）。

制作那些INLINABLE ，單獨模塊的區別

$ ./diffBench 
warming up
estimating clock resolution...
mean is 1.573571 us (320001 iterations)
found 2846 outliers among 319999 samples (0.9%)
  2182 (0.7%) high severe
estimating cost of a clock call...
mean is 40.54233 ns (12 iterations)

benchmarking lcs 10
mean: 1.628523 us, lb 1.618721 us, ub 1.638985 us, ci 0.950
std dev: 51.75533 ns, lb 47.04237 ns, ub 58.45611 ns, ci 0.950
variance introduced by outliers: 26.787%
variance is moderately inflated by outliers

和單模塊的情況

$ ./oneModule 
warming up
estimating clock resolution...
mean is 1.726459 us (320001 iterations)
found 2092 outliers among 319999 samples (0.7%)
  1608 (0.5%) high severe
estimating cost of a clock call...
mean is 39.98567 ns (14 iterations)

benchmarking lcs 10
mean: 1.523183 us, lb 1.514157 us, ub 1.533071 us, ci 0.950
std dev: 48.48541 ns, lb 44.43230 ns, ub 55.04251 ns, ci 0.950
variance introduced by outliers: 26.791%
variance is moderately inflated by outliers

很小。

GHC中的交叉模塊優化

問題描述

1 個解決方案

解決方案1
14 已采納 2013-06-04 08:01:28

GHC中的交叉模塊優化

問題描述

1 個解決方案

解決方案1 14 已采納 2013-06-04 08:01:28

解決方案1
14 已采納 2013-06-04 08:01:28