_
Hi, there,
Part of my program to compute differences between files makes use of the standard DP algorithm to compute the longest common noncontiguous subsequence between two lists. I've been running into performance issues with some of this functionality, so I ran HPC to profile, and found the following result:
individual inherited
COST CENTRE no. entries %time %alloc %time %alloc
(ommitted lines above)
longestCommonSubsequence 1 0.0 0.0 99.9 100.0
longestCommonSubsequence' 8855742 94.5 98.4 99.9 100.0
longestCommonSubsequence'' 8855742 4.2 0.8 5.4 1.6
longestCommonSubsequence''.caseY 3707851 0.6 0.6 0.6 0.6
longestCommonSubsequence''.caseX 3707851 0.6 0.2 0.6 0.2
(ommitted lines below)
Here's the offending code:
longestCommonSubsequence' :: forall a. (Eq a) => [a] -> [a] -> Int -> Int -> [a]
longestCommonSubsequence' xs ys i j =
(Memo.memo2 Memo.integral Memo.integral (longestCommonSubsequence'' xs ys)) i j
longestCommonSubsequence'' :: forall a. (Eq a) => [a] -> [a] -> Int -> Int -> [a]
longestCommonSubsequence'' [] _ _ _ = []
longestCommonSubsequence'' _ [] _ _ = []
longestCommonSubsequence'' (x:xs) (y:ys) i j =
if x == y
then x : (longestCommonSubsequence' xs ys (i + 1) (j + 1)) -- WLOG
else if (length caseX) > (length caseY)
then caseX
else caseY
where
caseX :: [a]
caseX = longestCommonSubsequence' xs (y:ys) (i + 1) j
caseY :: [a]
caseY = longestCommonSubsequence' (x:xs) ys i (j + 1)
I find it notable that all the time and memory usage is happening in longestCommonSubsequence'
, the memoizing wrapper. Hence, I would conclude that the performance hit is coming from all the lookups and cachings done by Data.Memocombinators
, despite how it's always performed admirably the many other times I've used it.
I guess my question is... this conclusion seems reasonable; is it? If so, then does anyone have any recommendations for other ways to achieve the DP?
For reference, it takes 12 seconds - which is absurdly long - to compare two 14-line-long files with respective contents "a\\nb\\nc\\n...m"
and "*a\\nb\\nc\\n...m*"
(same contents but with '*'
pre-pended and post-pended).
Thanks in advance! :)
EDIT: trying ghc-core
stuff now; will post an update if I can get it to play nicely with a Cabal project and get any useful information!
When you call Memo.memo2 Memo.integral Memo.integral (longestCommonSubsequence'' xs ys)
, it creates a memoizer for the function longestCommonSubsequence'' xs ys
. This means that there is one memoizer for each different value of xs
and ys
. I guess that most of the execution time is spent creating all those data structures for all those memoizers.
Did you mean to memoize on the 4 arguments of longestCommonSubsequence''
?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.