[英]Haskell avoiding stack overflow in folds without sacrificing performance
以下代码段在大型输入时遇到堆栈溢出:
{-# LANGUAGE DeriveDataTypeable, OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
genTweets :: L.ByteString -> L.ByteString
genTweets text | L.null text = ""
| otherwise = L.intercalate "\n\n" $ genTweets' $ L.words text
where genTweets' txt = foldr p [] txt
where p word [] = [word]
p word words@(w:ws) | L.length word + L.length w <= 139 =
(word `L.append` " " `L.append` w):ws
| otherwise = word:words
我以为我的谓词正在建立一个列表,但是我不确定为什么或如何解决。
使用foldl'
的等效代码运行良好,但是要花很多时间,因为它会不断追加,并且会占用大量内存。
import Data.List (foldl')
genTweetsStrict :: L.ByteString -> L.ByteString
genTweetsStrict text | L.null text = ""
| otherwise = L.intercalate "\n\n" $ genTweetsStrict' $ L.words text
where genTweetsStrict' txt = foldl' p [] txt
where p [] word = [word]
p words word | L.length word + L.length (last words) <= 139 =
init words ++ [last words `L.append` " " `L.append` word]
| otherwise = words ++ [word]
是什么导致第一个代码片段积累了重音,可以避免吗? 是否可以编写第二个代码段,使其不依赖(++)
?
L.length word + L.length (last words) <= 139
这就是问题。 在每次迭代中,您都要遍历累加器列表,然后
init words ++ [last words `L.append` " " `L.append` word]
在末尾追加。 显然,这将花费很长时间(与累加器列表的长度成比例)。 更好的解决方案是延迟生成输出列表,在读取输入流的过程中进行交错处理(您无需读取整个输入即可输出前140个字符的tweet)。
以下程序版本使用O(1)空间在1秒内处理了一个相对较大的文件( /usr/share/dict/words
):
{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where
import qualified Data.ByteString.Lazy.Char8 as L
import Data.Int (Int64)
genTweets :: L.ByteString -> L.ByteString
genTweets text | L.null text = ""
| otherwise = L.intercalate "\n\n" $ toTweets $ L.words text
where
-- Concatenate words into 139-character tweets.
toTweets :: [L.ByteString] -> [L.ByteString]
toTweets [] = []
toTweets [w] = [w]
toTweets (w:ws) = go (L.length w, w) ws
-- Main loop. Notice how the output tweet (cur_str) is generated as soon as
-- possible, thus enabling L.writeFile to consume it before the whole
-- input is processed.
go :: (Int64, L.ByteString) -> [L.ByteString] -> [L.ByteString]
go (_cur_len, !cur_str) [] = [cur_str]
go (!cur_len, !cur_str) (w:ws)
| lw + cur_len <= 139 = go (cur_len + lw + 1,
cur_str `L.append` " " `L.append` w) ws
| otherwise = cur_str : go (lw, w) ws
where
lw = L.length w
-- Notice the use of lazy I/O.
main :: IO ()
main = do dict <- L.readFile "/usr/share/dict/words"
L.writeFile "tweets" (genTweets dict)
p word words@(w:ws)
这种模式匹配导致对“ tail”的评估,这当然是文件夹p [](w:ws)的结果,这是pw ws的结果,这导致ws再次对头部进行模式匹配,依此类推。 。
请注意,foldr和foldl'将以不同方式拆分文本。 foldr将使最短的tweet首先出现,foldl'将使最短的tweet最后出现。
我会这样处理:
genTweets' = unfoldr f where
f [] = Nothing
f (w:ws) = Just $ g w ws $ L.length w
g w [] _ = (w, [])
g w ws@(w':_) len | len+1+(L.length w') > 139 = (w,ws)
g w (w':ws') len = g (w `L.append` " " `L.append` w') ws' $ len+1+(L.length w')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.