简体   繁体   English

用于列表处理的Haskell优化受到Lazy Evaluation的阻碍

[英]Haskell Optimizations for List Processing stymied by Lazy Evaluation

I'm trying to improve the efficiency of the following code. 我正在努力提高以下代码的效率。 I want to count all occurrences of a symbol before a given point (as part of pattern-matching using a Burrows-Wheeler transform). 我想在给定点之前计算符号的所有出现次数(作为使用Burrows-Wheeler变换的模式匹配的一部分)。 There's some overlap in how I'm counting symbols. 我如何计算符号有一些重叠。 However, when I have tried to implement what looks like it should be more efficient code, it turns out to be less efficient, and I'm assuming that lazy evaluation and my poor understanding of it is to blame. 然而,当我试图实现看起来应该是更有效的代码时,结果效率会降低,而且我假设懒惰的评估和我对它的不良理解是责任。

My first attempt at a counting function went like this: 我对计数功能的第一次尝试是这样的:

count :: Ord a => [a] -> a -> Int -> Int
count list sym pos = length . filter (== sym) . take pos $ list

Then in the body of the matching function itself: 然后在匹配函数本身的主体中:

matching str refCol pattern = match 0 (n - 1) (reverse pattern)
  where n = length str
        refFstOcc sym = length $ takeWhile (/= sym) refCol
        match top bottom [] = bottom - top + 1
        match top bottom (sym : syms) =
          let topCt = count str sym top
              bottomCt = count str sym (bottom + 1)
              middleCt = bottomCt - topCt
              refCt = refFstOcc sym
          in if middleCt > 0
               then match (refCt + topCt) (refCt + bottomCt - 1) syms
               else 0

(Stripped down for brevity - I'm memoizing first occurrences of symbols in refCol through a Map, and a couple other details as well). (为简洁起见 - 我通过Map记录refCol中第一次出现的符号,以及其他一些细节)。

Edit : Sample use would be: 编辑 :示例使用将是:

matching "AT$TCTAGT" "$AACGTTTT" "TCG"

which should be 1 (assuming I didn't mistype anything). 应该是1(假设我没有输错任何东西)。

Now, I'm recounting everything in the middle between the top pointer and the bottom twice, which adds up when I count a million character DNA string with only 4 possible choices for characters (and profiling tells me that this is the big bottleneck, too, taking 48% of my time for bottomCt and around 38% of my time for topCt). 现在,我在top指针和bottom之间的中间重述两次,当我计算一百万个字符的DNA字符串时,只有4个可能的字符选择(并且分析告诉我这也是一个很大的瓶颈)。 ,我花了48%的时间用于bottomCt,大约38%的时间用于topCt)。 For reference, when calculating this for a million character string and trying to match 50 patterns (each of which is between 1 and 1000 characters), the program takes about 8.5 to 9.5 seconds to run. 作为参考,当计算一百万个字符串并尝试匹配50个模式(每个模式在1到1000个字符之间)时,程序运行大约需要8.5到9.5秒。

However, if I try to implement the following function: 但是,如果我尝试实现以下功能:

countBetween :: Ord a => [a] -> a -> Int -> Int -> (Int, Int)
countBetween list sym top bottom =
  let (topList, bottomList) = splitAt top list
      midList = take (bottom - top) bottomList
      getSyms = length . filter (== sym)
  in (getSyms topList, getSyms midList)

(with changes made to the matching function to compensate), the program takes between 18 and 22 seconds to run. (对匹配功能进行更改以进行补偿),程序运行需要18到22秒。

I've also tried passing in a Map which can keep track of previous calls, but that also takes about 20 seconds to run and runs up the memory usage. 我也试过传入一个可以跟踪之前调用的Map,但是这也需要大约20秒才能运行并运行内存使用量。

Similarly, I've shorted length . filter (== sym) 同样,我缩短了length . filter (== sym) length . filter (== sym) to a fold , but again - 20 seconds for foldr , and 14-15 for foldl . length . filter (== sym)fold ,但是再次 - 对于foldr为20秒,对于foldl为14-15。

So what would be a proper Haskell way to optimize this code through rewriting it? 那么通过重写来优化这段代码的Haskell方法是什么呢? (Specifically, I'm looking for something that doesn't involve precomputation - I may not be reusing strings very much - and which explains something of why this is happening). (具体来说,我正在寻找一些不涉及预计算的东西 - 我可能不会重复使用字符串 - 这解释了为什么会发生这种情况)。

Edit : More clearly, what I am looking for is the following: 编辑 :更清楚的是,我正在寻找的是以下内容:

a) Why does this behaviour happen in Haskell? a)为什么在Haskell中会发生这种行为? How does lazy evaluation play a role, what optimizations is the compiler making to rewrite the count and countBetween functions, and what other factors may be involved? 惰性求值如何发挥作用,编译器对重写countcountBetween函数进行了哪些优化,以及可能涉及的其他因素是什么?

b) What is a simple code rewrite which would address this issue so that I don't traverse the lists multiple times? b)什么是简单的代码重写,它将解决这个问题,以便我不会多次遍历列表? I'm looking specifically for something which addresses that issue, rather than a solution which sidesteps it. 我正在寻找能够解决这个问题的东西,而不是一个可以避开它的解决方案。 If the final answer is, count is the most efficient possible way to write the code, why is that? 如果最终答案是, count是编写代码的最有效方法,为什么呢?

It seems that the main point of the match routine is to transform a interval (bottom,top) to another interval based on the current symbol sym . 似乎match例程的要点是基于当前符号sym将间隔(bottom,top)变换为另一个间隔。 The formulas are basically: 公式基本上是:

ref_fst = index of sym in ref_col
  -- defined in an outer scope

match :: Char -> (Int,Int) -> (Int,Int)
match sym (bottom, top) | bottom > top =  (bottom, top) -- if the empty interval
match sym (bottom, top) =
  let 
    top_count = count of sym in str from index 0 to top
    bot_count = count of sym in str from index 0 to bottom
    mid_count = top_count - bot_count
  in if mid_count > 0
         then (ref_fst + bot_count, ref_fst + top_count)
         else (1,0)  -- the empty interval

And then matching is just a fold over pattern using match with the initial interval (0, n-1) . 然后matching是刚刚超过一个折叠pattern使用match与初始间隔(0, n-1)

Both top_count and bot_count can be computed efficiently using a precomputed lookup table, and below is code which does that. top_countbot_count都可以使用预先计算的查找表进行有效计算,下面是执行此操作的代码。

If you run test1 you'll see a trace of how the interval is transformed via each symbol in the pattern. 如果运行test1您将看到如何通过模式中的每个符号转换间隔。

Note: There may be off-by-1 errors, and I've hard coded ref_fst to be 0 - I'm not sure how this fits into the larger algorithm, but the basic idea should be sound. 注意:可能有1个错误的错误,并且我将ref_fst硬编码为0 - 我不确定这是如何适应更大的算法,但基本的想法应该是合理的。

Note that once the counts vector has been created there is no need to index into the original string anymore. 请注意,一旦创建了counts向量,就不再需要索引原始字符串了。 Therefore, even though I use a ByteString here for the (larger) DNA sequence, it's not crucial, and the mkCounts routine should work just as well if passed a String instead. 因此,即使我在这里使用ByteString(较大的)DNA序列,它也不是至关重要的,如果传递一个String, mkCounts例程应该也能正常工作。

Code also available at http://lpaste.net/174288 代码也可在http://lpaste.net/174288获得

{-# LANGUAGE OverloadedStrings #-}

import Data.Vector.Unboxed ((!))
import qualified Data.Vector.Unboxed as UV
import qualified Data.Vector.Unboxed.Mutable as UVM
import qualified Data.ByteString.Char8 as BS
import Debug.Trace
import Text.Printf
import Data.List

mkCounts :: BS.ByteString -> UV.Vector (Int,Int,Int,Int)
mkCounts syms = UV.create $ do
  let n = BS.length syms
  v <- UVM.new (n+1)
  let loop x i | i >= n = return x
      loop x i = let s = BS.index syms i
                     (a,t,c,g) = x
                     x' = case s of
                            'A' -> (a+1,t,c,g)
                            'T' -> (a,t+1,c,g)
                            'C' -> (a,t,c+1,g)
                            'G' -> (a,t,c,g+1)
                            _   -> x
                 in do UVM.write v i x
                       loop x' (i+1) 
  x <- loop (0,0,0,0) 0
  UVM.write v n x
  return v

data DNA = A | C | T | G
  deriving (Show)

getter :: DNA -> (Int,Int,Int,Int) -> Int
getter A (a,_,_,_) = a
getter T (_,t,_,_) = t
getter C (_,_,c,_) = c
getter G (_,_,_,g) = g

-- narrow a window
narrow :: Int -> UV.Vector (Int,Int,Int,Int) -> DNA -> (Int,Int) ->  (Int,Int)

narrow refcol counts sym (lo,hi) | trace msg False = undefined
  where msg = printf "-- lo: %d  hi: %d  refcol: %d  sym: %s  top_cnt: %d  bot_count: %d" lo hi refcol (show sym) top_count bot_count
        top_count = getter sym (counts ! (hi+1))
        bot_count = getter sym (counts ! lo)

narrow refcol counts sym (lo,hi) =
  let top_count = getter sym (counts ! (hi+1))
      bot_count = getter sym (counts ! (lo+0))
      mid_count = top_count - bot_count
  in if mid_count > 0
       then ( refcol + bot_count, refcol + top_count-1 )
       else (lo+1,lo)  -- signal an wmpty window

findFirst :: DNA -> UV.Vector (Int,Int,Int,Int)  -> Int
findFirst sym v =
  let n = UV.length v
      loop i | i >= n = n
      loop i = if getter sym (v ! i) > 0
                 then i
                 else loop (i+1)
  in loop 0

toDNA :: String -> [DNA]
toDNA str = map charToDNA str

charToDNA :: Char -> DNA
charToDNA = go
  where go 'A' = A
        go 'C' = C
        go 'T' = T
        go 'G' = G

dnaToChar A = 'A'
dnaToChar C = 'C'
dnaToChar T = 'T'
dnaToChar G = 'G'

first :: DNA -> BS.ByteString -> Int
first sym str = maybe len id (BS.elemIndex (dnaToChar sym) str)
  where len = BS.length str

test2 = do
 -- matching "AT$TCTAGT" "$AACGTTTT" "TCG"
  let str    = "AT$TCTAGT"
      refcol = "$AACGTTTT"
      syms   = toDNA "TCG"

      -- hard coded for now
      -- may be computeed an memoized
      refcol_G = 4
      refcol_C = 3
      refcol_T = 5

      counts = mkCounts str
      w0 = (0, BS.length str -1)

      w1 = narrow refcol_G counts G w0
      w2 = narrow refcol_C counts C w1
      w3 = narrow refcol_T counts T w2

      firsts = (first A refcol, first T refcol, first C refcol, first G refcol)

  putStrLn $ "firsts: " ++ show firsts

  putStrLn $ "w0: " ++ show w0
  putStrLn $ "w1: " ++ show w1
  putStrLn $ "w2: " ++ show w2
  putStrLn $ "w3: " ++ show w3
  let (lo,hi) = w3
      len = if lo <= hi then hi - lo + 1 else 0
  putStrLn $ "length: " ++ show len

matching :: BS.ByteString -> BS.ByteString -> String -> Int
matching  str refcol pattern = 
  let counts = mkCounts str
      n = BS.length str
      syms = toDNA (reverse pattern)
      firsts = (first A refcol, first T refcol, first C refcol, first G refcol)

      go (lo,hi) sym = narrow refcol counts sym (lo,hi)
        where refcol = getter sym firsts

      (lo, hi) = foldl' go (0,n-1) syms
      len = if lo <= hi then hi - lo + 1 else 0
  in len

test3 = matching "AT$TCTAGT" "$AACGTTTT" "TCG"

I'm not sure lazy evaluation has much to do with the performance of the code. 我不确定懒惰的评估与代码的性能有多大关系。 I think the main problem is the use of String - which is a linked list - instead of more performant string type. 我认为主要问题是使用String - 这是一个链表 - 而不是更高性能的字符串类型。

Note that this call in your countBetween function: 请注意,您的countBetween函数中的此调用:

  let (topList, bottomList) = splitAt top list

will re-create the linked link corresponding to topList meaning a lot more allocations. 将重新创建与topList相对应的链接链接, topList意味着更多的分配。

A Criterion benchmark to compare splitAt versus using take n/drop n may be found here: http://lpaste.net/174526 . 可以在此处找到用于比较splitAt与使用take n/drop n的Criterion基准: http//lpaste.net/174526 The splitAt version is about 3 times slower and, of course, has a lot more allocations. splitAt版本慢了大约3倍,当然,还有更多的分配。

Even if you don't want to "pre-compute" the counts you can improve matters a great deal by simply switching to either ByteString or Text. 即使您不想“预先计算”计数,您也可以通过简单地切换到ByteString或Text来改善问题。

Define: 限定:

countSyms :: Char -> ByteString -> Int -> Int -> Int
countSyms sym str lo hi =
  length [ i | i <- [lo..hi], BS.index str i == sym ]

and then: 然后:

countBetween :: ByteString -> Char -> Int -> Int -> (Int,Int)
countBetween str sym top bottom = (a,b)
  where a = countSyms sym str 0 (top-1)
        b = countSyms sym str top (bottom-1)

Also, don't use reverse on large lists - it will reallocate the entire list. 此外,不要在大型​​列表上使用reverse - 它将重新分配整个列表。 Just index into a ByteString / Text in reverse. 只需反向索引ByteString / Text即可。

Memoizing counts may or may not help. 记忆计数可能有所帮助,也可能没有帮助。 It all depends on how it's done. 这一切都取决于它是如何完成的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM