[英]Haskell Optimizations for List Processing stymied by Lazy Evaluation
I'm trying to improve the efficiency of the following code. 我正在努力提高以下代码的效率。 I want to count all occurrences of a symbol before a given point (as part of pattern-matching using a Burrows-Wheeler transform). 我想在给定点之前计算符号的所有出现次数(作为使用Burrows-Wheeler变换的模式匹配的一部分)。 There's some overlap in how I'm counting symbols. 我如何计算符号有一些重叠。 However, when I have tried to implement what looks like it should be more efficient code, it turns out to be less efficient, and I'm assuming that lazy evaluation and my poor understanding of it is to blame. 然而,当我试图实现看起来应该是更有效的代码时,结果效率会降低,而且我假设懒惰的评估和我对它的不良理解是责任。
My first attempt at a counting function went like this: 我对计数功能的第一次尝试是这样的:
count :: Ord a => [a] -> a -> Int -> Int
count list sym pos = length . filter (== sym) . take pos $ list
Then in the body of the matching function itself: 然后在匹配函数本身的主体中:
matching str refCol pattern = match 0 (n - 1) (reverse pattern)
where n = length str
refFstOcc sym = length $ takeWhile (/= sym) refCol
match top bottom [] = bottom - top + 1
match top bottom (sym : syms) =
let topCt = count str sym top
bottomCt = count str sym (bottom + 1)
middleCt = bottomCt - topCt
refCt = refFstOcc sym
in if middleCt > 0
then match (refCt + topCt) (refCt + bottomCt - 1) syms
else 0
(Stripped down for brevity - I'm memoizing first occurrences of symbols in refCol through a Map, and a couple other details as well). (为简洁起见 - 我通过Map记录refCol中第一次出现的符号,以及其他一些细节)。
Edit : Sample use would be: 编辑 :示例使用将是:
matching "AT$TCTAGT" "$AACGTTTT" "TCG"
which should be 1 (assuming I didn't mistype anything). 应该是1(假设我没有输错任何东西)。
Now, I'm recounting everything in the middle between the top
pointer and the bottom
twice, which adds up when I count a million character DNA string with only 4 possible choices for characters (and profiling tells me that this is the big bottleneck, too, taking 48% of my time for bottomCt and around 38% of my time for topCt). 现在,我在top
指针和bottom
之间的中间重述两次,当我计算一百万个字符的DNA字符串时,只有4个可能的字符选择(并且分析告诉我这也是一个很大的瓶颈)。 ,我花了48%的时间用于bottomCt,大约38%的时间用于topCt)。 For reference, when calculating this for a million character string and trying to match 50 patterns (each of which is between 1 and 1000 characters), the program takes about 8.5 to 9.5 seconds to run. 作为参考,当计算一百万个字符串并尝试匹配50个模式(每个模式在1到1000个字符之间)时,程序运行大约需要8.5到9.5秒。
However, if I try to implement the following function: 但是,如果我尝试实现以下功能:
countBetween :: Ord a => [a] -> a -> Int -> Int -> (Int, Int)
countBetween list sym top bottom =
let (topList, bottomList) = splitAt top list
midList = take (bottom - top) bottomList
getSyms = length . filter (== sym)
in (getSyms topList, getSyms midList)
(with changes made to the matching function to compensate), the program takes between 18 and 22 seconds to run. (对匹配功能进行更改以进行补偿),程序运行需要18到22秒。
I've also tried passing in a Map which can keep track of previous calls, but that also takes about 20 seconds to run and runs up the memory usage. 我也试过传入一个可以跟踪之前调用的Map,但是这也需要大约20秒才能运行并运行内存使用量。
Similarly, I've shorted length . filter (== sym)
同样,我缩短了length . filter (== sym)
length . filter (== sym)
to a fold
, but again - 20 seconds for foldr
, and 14-15 for foldl
. length . filter (== sym)
到fold
,但是再次 - 对于foldr
为20秒,对于foldl
为14-15。
So what would be a proper Haskell way to optimize this code through rewriting it? 那么通过重写来优化这段代码的Haskell方法是什么呢? (Specifically, I'm looking for something that doesn't involve precomputation - I may not be reusing strings very much - and which explains something of why this is happening). (具体来说,我正在寻找一些不涉及预计算的东西 - 我可能不会重复使用字符串 - 这解释了为什么会发生这种情况)。
Edit : More clearly, what I am looking for is the following: 编辑 :更清楚的是,我正在寻找的是以下内容:
a) Why does this behaviour happen in Haskell? a)为什么在Haskell中会发生这种行为? How does lazy evaluation play a role, what optimizations is the compiler making to rewrite the count
and countBetween
functions, and what other factors may be involved? 惰性求值如何发挥作用,编译器对重写count
和countBetween
函数进行了哪些优化,以及可能涉及的其他因素是什么?
b) What is a simple code rewrite which would address this issue so that I don't traverse the lists multiple times? b)什么是简单的代码重写,它将解决这个问题,以便我不会多次遍历列表? I'm looking specifically for something which addresses that issue, rather than a solution which sidesteps it. 我正在寻找能够解决这个问题的东西,而不是一个可以避开它的解决方案。 If the final answer is, count
is the most efficient possible way to write the code, why is that? 如果最终答案是, count
是编写代码的最有效方法,为什么呢?
It seems that the main point of the match
routine is to transform a interval (bottom,top)
to another interval based on the current symbol sym
. 似乎match
例程的要点是基于当前符号sym
将间隔(bottom,top)
变换为另一个间隔。 The formulas are basically: 公式基本上是:
ref_fst = index of sym in ref_col
-- defined in an outer scope
match :: Char -> (Int,Int) -> (Int,Int)
match sym (bottom, top) | bottom > top = (bottom, top) -- if the empty interval
match sym (bottom, top) =
let
top_count = count of sym in str from index 0 to top
bot_count = count of sym in str from index 0 to bottom
mid_count = top_count - bot_count
in if mid_count > 0
then (ref_fst + bot_count, ref_fst + top_count)
else (1,0) -- the empty interval
And then matching
is just a fold over pattern
using match
with the initial interval (0, n-1)
. 然后matching
是刚刚超过一个折叠pattern
使用match
与初始间隔(0, n-1)
Both top_count
and bot_count
can be computed efficiently using a precomputed lookup table, and below is code which does that. top_count
和bot_count
都可以使用预先计算的查找表进行有效计算,下面是执行此操作的代码。
If you run test1
you'll see a trace of how the interval is transformed via each symbol in the pattern. 如果运行test1
您将看到如何通过模式中的每个符号转换间隔。
Note: There may be off-by-1 errors, and I've hard coded ref_fst
to be 0 - I'm not sure how this fits into the larger algorithm, but the basic idea should be sound. 注意:可能有1个错误的错误,并且我将ref_fst
硬编码为0 - 我不确定这是如何适应更大的算法,但基本的想法应该是合理的。
Note that once the counts
vector has been created there is no need to index into the original string anymore. 请注意,一旦创建了counts
向量,就不再需要索引原始字符串了。 Therefore, even though I use a ByteString here for the (larger) DNA sequence, it's not crucial, and the mkCounts
routine should work just as well if passed a String instead. 因此,即使我在这里使用ByteString(较大的)DNA序列,它也不是至关重要的,如果传递一个String, mkCounts
例程应该也能正常工作。
Code also available at http://lpaste.net/174288 代码也可在http://lpaste.net/174288获得
{-# LANGUAGE OverloadedStrings #-}
import Data.Vector.Unboxed ((!))
import qualified Data.Vector.Unboxed as UV
import qualified Data.Vector.Unboxed.Mutable as UVM
import qualified Data.ByteString.Char8 as BS
import Debug.Trace
import Text.Printf
import Data.List
mkCounts :: BS.ByteString -> UV.Vector (Int,Int,Int,Int)
mkCounts syms = UV.create $ do
let n = BS.length syms
v <- UVM.new (n+1)
let loop x i | i >= n = return x
loop x i = let s = BS.index syms i
(a,t,c,g) = x
x' = case s of
'A' -> (a+1,t,c,g)
'T' -> (a,t+1,c,g)
'C' -> (a,t,c+1,g)
'G' -> (a,t,c,g+1)
_ -> x
in do UVM.write v i x
loop x' (i+1)
x <- loop (0,0,0,0) 0
UVM.write v n x
return v
data DNA = A | C | T | G
deriving (Show)
getter :: DNA -> (Int,Int,Int,Int) -> Int
getter A (a,_,_,_) = a
getter T (_,t,_,_) = t
getter C (_,_,c,_) = c
getter G (_,_,_,g) = g
-- narrow a window
narrow :: Int -> UV.Vector (Int,Int,Int,Int) -> DNA -> (Int,Int) -> (Int,Int)
narrow refcol counts sym (lo,hi) | trace msg False = undefined
where msg = printf "-- lo: %d hi: %d refcol: %d sym: %s top_cnt: %d bot_count: %d" lo hi refcol (show sym) top_count bot_count
top_count = getter sym (counts ! (hi+1))
bot_count = getter sym (counts ! lo)
narrow refcol counts sym (lo,hi) =
let top_count = getter sym (counts ! (hi+1))
bot_count = getter sym (counts ! (lo+0))
mid_count = top_count - bot_count
in if mid_count > 0
then ( refcol + bot_count, refcol + top_count-1 )
else (lo+1,lo) -- signal an wmpty window
findFirst :: DNA -> UV.Vector (Int,Int,Int,Int) -> Int
findFirst sym v =
let n = UV.length v
loop i | i >= n = n
loop i = if getter sym (v ! i) > 0
then i
else loop (i+1)
in loop 0
toDNA :: String -> [DNA]
toDNA str = map charToDNA str
charToDNA :: Char -> DNA
charToDNA = go
where go 'A' = A
go 'C' = C
go 'T' = T
go 'G' = G
dnaToChar A = 'A'
dnaToChar C = 'C'
dnaToChar T = 'T'
dnaToChar G = 'G'
first :: DNA -> BS.ByteString -> Int
first sym str = maybe len id (BS.elemIndex (dnaToChar sym) str)
where len = BS.length str
test2 = do
-- matching "AT$TCTAGT" "$AACGTTTT" "TCG"
let str = "AT$TCTAGT"
refcol = "$AACGTTTT"
syms = toDNA "TCG"
-- hard coded for now
-- may be computeed an memoized
refcol_G = 4
refcol_C = 3
refcol_T = 5
counts = mkCounts str
w0 = (0, BS.length str -1)
w1 = narrow refcol_G counts G w0
w2 = narrow refcol_C counts C w1
w3 = narrow refcol_T counts T w2
firsts = (first A refcol, first T refcol, first C refcol, first G refcol)
putStrLn $ "firsts: " ++ show firsts
putStrLn $ "w0: " ++ show w0
putStrLn $ "w1: " ++ show w1
putStrLn $ "w2: " ++ show w2
putStrLn $ "w3: " ++ show w3
let (lo,hi) = w3
len = if lo <= hi then hi - lo + 1 else 0
putStrLn $ "length: " ++ show len
matching :: BS.ByteString -> BS.ByteString -> String -> Int
matching str refcol pattern =
let counts = mkCounts str
n = BS.length str
syms = toDNA (reverse pattern)
firsts = (first A refcol, first T refcol, first C refcol, first G refcol)
go (lo,hi) sym = narrow refcol counts sym (lo,hi)
where refcol = getter sym firsts
(lo, hi) = foldl' go (0,n-1) syms
len = if lo <= hi then hi - lo + 1 else 0
in len
test3 = matching "AT$TCTAGT" "$AACGTTTT" "TCG"
I'm not sure lazy evaluation has much to do with the performance of the code. 我不确定懒惰的评估与代码的性能有多大关系。 I think the main problem is the use of String - which is a linked list - instead of more performant string type. 我认为主要问题是使用String - 这是一个链表 - 而不是更高性能的字符串类型。
Note that this call in your countBetween
function: 请注意,您的countBetween
函数中的此调用:
let (topList, bottomList) = splitAt top list
will re-create the linked link corresponding to topList
meaning a lot more allocations. 将重新创建与topList
相对应的链接链接, topList
意味着更多的分配。
A Criterion benchmark to compare splitAt
versus using take n/drop n
may be found here: http://lpaste.net/174526 . 可以在此处找到用于比较splitAt
与使用take n/drop n
的Criterion基准: http : //lpaste.net/174526 。 The splitAt
version is about 3 times slower and, of course, has a lot more allocations. splitAt
版本慢了大约3倍,当然,还有更多的分配。
Even if you don't want to "pre-compute" the counts you can improve matters a great deal by simply switching to either ByteString or Text. 即使您不想“预先计算”计数,您也可以通过简单地切换到ByteString或Text来改善问题。
Define: 限定:
countSyms :: Char -> ByteString -> Int -> Int -> Int
countSyms sym str lo hi =
length [ i | i <- [lo..hi], BS.index str i == sym ]
and then: 然后:
countBetween :: ByteString -> Char -> Int -> Int -> (Int,Int)
countBetween str sym top bottom = (a,b)
where a = countSyms sym str 0 (top-1)
b = countSyms sym str top (bottom-1)
Also, don't use reverse
on large lists - it will reallocate the entire list. 此外,不要在大型列表上使用reverse
- 它将重新分配整个列表。 Just index into a ByteString / Text in reverse. 只需反向索引ByteString / Text即可。
Memoizing counts may or may not help. 记忆计数可能有所帮助,也可能没有帮助。 It all depends on how it's done. 这一切都取决于它是如何完成的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.