Understanding the performance of Haskell functions

Question

I was writing a haskell program where I needed to zip two lists with a function, but I didn't want it to stop at the end of the shorter list (like the standard version of zipWith ) but rather continue until the end of the longer list, using some default value after reaching the end of the shorter list. My first implementation looked like this:

zipWithAll :: (a -> b -> c) -> a -> b -> [a] -> [b] -> [c]
zipWithAll f x y = go
    where go []     []     = []
          go (a:as) []     = f a y : go as []
          go []     (b:bs) = f x b : go [] bs
          go (a:as) (b:bs) = f a b : go as bs

However, I usually prefer writing functions using the standard library higher order functions, rather than using explicit recursion, since it often increases performance and makes the code nicer to look at. So I tried this:

zipWithAll' :: (a -> b -> c) -> a -> b -> [a] -> [b] -> [c]
zipWithAll' f x y xs ys = zipWith f xs' ys'
    where n   = max (length xs) (length ys)
          xs' = take n $ xs ++ repeat x
          ys' = take n $ ys ++ repeat y

To me this looked much worse performance wise, since using length twice means two more list traversals. But surprisingly, when I compared the average time, the first version was about 20% slower than the second.

So, I figured I would go with the second one and add arguments for the length of the lists, since in some cases, they would be known beforehand. Thus, I wrote this:

zipWithAll'' :: (a -> b -> c) -> Int -> Int -> a -> b -> [a] -> [b] -> [c]
zipWithAll'' f n m x y xs ys = zipWith f xs' ys'
    where k   = max n m
          xs' = take k $ xs ++ repeat x
          ys' = take k $ ys ++ repeat y

But, even more surprisingly, the third version only improved performance by a tiny margin. For two randomly generated lists of Int , xs and ys , length xs = length ys = n = 1000000 , I got the following:

| function                       | average time, 30 evaluations |
+--------------------------------+------------------------------+
| zipWithAll   (+) 0 0 xs ys     |                        1.20s |
| zipWithAll'  (+) 0 0 xs ys     |                        0.95s |
| zipWithAll'' (+) n n 0 0 xs ys |                        0.94s |
+--------------------------------+------------------------------+

I get that this isn't the most comprehensive benchmarking test, but still, it seems to go against all my intuition for what makes haskell programs run faster. It makes me think that I'm missing something important for understanding the performance of haskell functions.

So basically what I would like to know is this:

Why is the simple recursive approach the slowest? Is it just due to optimizations done for the standard zipWith function? If so, is there something I could do to make it perform similarly to zipWith ?

Also, am I wrong in assuming that the second version does 3n operations while the third only does n ? If so, why doesn't this have a greater impact on performance? I could imagine this being less relevant if the zipping function was very time consuming but I'm only using (+) here.

Finally, is there a way to implement a faster zipWithAll , by taking advantage of the optimizations for the standard library functions, without knowing the lengths of the lists beforehand?

(Edit) This is the relevant parts of the benchmarking code that I used.

{-# LANGUAGE BangPatterns #-}

import Control.Monad (replicateM, forM_)
import Data.Foldable (foldl')
import Data.Time (diffUTCTime, getCurrentTime, NominalDiffTime)
import Numeric (showEFloat, showFFloat)
import Test.QuickCheck

main = do
    let n  = 1000000
        fs = [ ("zipWithAll", uncurry4 $ zipWithAll (+))
             , ("zipWithAll'", uncurry4 $ zipWithAll' (+))
             , ("zipWithAll''", uncurry4 $ zipWithAll'' (+) n n)]

    xs <- generate (vectorOf n arbitrary :: Gen [Int])
    ys <- generate (vectorOf n arbitrary :: Gen [Int])
    benchmark fs (0, 0, xs, ys) 30

uncurry4 :: (a -> b -> c -> d -> e) -> (a,b,c,d) -> e
uncurry4 f (a,b,c,d) = f a b c d

-- | Measure and print the average time it takes for each function in the list to return.
benchmark :: (Show a, Show b) => [(String, (a -> b))] -> a -> Int -> IO ()
benchmark fs x rep = do
    force x
    forM_ fs $ \(name, f) -> do
        ts <- replicateM rep (measureTime f x)
        putStrLn $ "function: " ++ name ++ ", time = " ++ (showSignificant 2 $ average ts)

-- | Get the time measurement for a function applied to an arguemnt
measureTime :: Show b => (a -> b) -> a -> IO NominalDiffTime
measureTime f x = do
    t1 <- getCurrentTime
    force (f x)
    t2 <- getCurrentTime
    return $ diffUTCTime t2 t1

-- | Force the computation of a value
force :: Show a => a -> IO ()
force a = maximum (show a) `seq` return ()


-- | Show a time difference using @n@ significant figures
showSignificant :: Int -> NominalDiffTime -> String
showSignificant n a = showFFloat Nothing b "s"
  where
    ae = showEFloat (Just (n-1)) (fromRational (toRational a)) ""
    b  = read ae :: Double

-- | Take the average of the elements in a foldable data structure
average :: (Foldable t, Fractional a) => t a -> a
average = uncurry (/) . foldl' f (0,0)
    where f (s,l) x = (s', l')
            where !s' = x + s
                  !l' = 1 + l

Answer 1

Criterion is the gold standard for benchmarking in Haskell. I don't really believe benchmarks that come from anywhere else, so I ported your suite to Criterion. I've included my source file at the bottom of this Answer, so that in case I have done something wrong, someone can fix it easily. One important difference: I made the lists xs and ys different sizes, to actually exercise the interesting part of your function: ys is twice as big as xs . Here are the results I see:

benchmarking standalone/zipWithAll
time                 18.53 ms   (18.08 ms .. 18.96 ms)
                     0.996 R²   (0.993 R² .. 0.998 R²)
mean                 19.77 ms   (19.28 ms .. 20.40 ms)
std dev              1.345 ms   (1.021 ms .. 1.711 ms)
variance introduced by outliers: 30% (moderately inflated)

benchmarking standalone/zipWithAll'
time                 43.61 ms   (43.25 ms .. 44.00 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 43.37 ms   (43.27 ms .. 43.57 ms)
std dev              256.9 μs   (140.4 μs .. 437.6 μs)

benchmarking standalone/zipWithAll''
time                 27.65 ms   (27.32 ms .. 28.20 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 27.58 ms   (27.40 ms .. 27.92 ms)
std dev              513.2 μs   (339.7 μs .. 770.0 μs)

Your simple recursive approach is twice as fast as the version that traverses the list twice - not so surprising, really, If you pass the sizes up front, you save some of that extra expense, but you're still doing concatenation and take , neither of which is free, so you come out noticeably behind.

Why is the simple version fastest? You mention that zipWith has an optimized implementation in the standard library, but if you look at it , you'll see that its implementation is exactly what you or I would have written. The one interesting thing is the note about fusion, which I think mostly means that, if you write map succ (zipWith f (filter even xs) ys) or something of the like, it can fuse the filter , zipWith and map into a single looping operation, without having to materialize intermediate lists. So, I lied when above I claimed my only interesting modification to your suite was to change the list sizes. I also added benchmarks for using the functions in this way, which we can see here:

benchmarking fused/zipWithAll
time                 43.80 ms   (43.43 ms .. 44.29 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 43.65 ms   (43.41 ms .. 43.94 ms)
std dev              522.9 μs   (374.5 μs .. 651.1 μs)

benchmarking fused/zipWithAll'
time                 132.3 ms   (128.3 ms .. 138.7 ms)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 131.4 ms   (127.6 ms .. 133.7 ms)
std dev              4.495 ms   (2.430 ms .. 6.794 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking fused/zipWithAll''
time                 52.83 ms   (52.36 ms .. 53.36 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 52.51 ms   (52.34 ms .. 52.72 ms)
std dev              366.1 μs   (246.8 μs .. 486.0 μs)

Version 1 is still the winner, but version 2 has gotten even worse while version 3 has narrowed the gap. Is this evidence of anything in particular? Maybe it suggests that some of the excess operations do get fused when using zipWithAll'' . Probably not, though - I bet the f and f' lambdas make it too hard for GHC to inline all the way. But I'd have to look at the core to get any better guesses, and I don't have time to get into that right now. You can give it a try with -ddump-simpl if you want.

As promised, here's the code for my benchmark:

module Main (main) where

import Criterion.Main
import Test.QuickCheck

zipWithAll :: (a -> b -> c) -> a -> b -> [a] -> [b] -> [c]
zipWithAll f x y = go
    where go []     []     = []
          go (a:as) []     = f a y : go as []
          go []     (b:bs) = f x b : go [] bs
          go (a:as) (b:bs) = f a b : go as bs

zipWithAll' :: (a -> b -> c) -> a -> b -> [a] -> [b] -> [c]
zipWithAll' f x y xs ys = zipWith f xs' ys'
    where n   = max (length xs) (length ys)
          xs' = take n $ xs ++ repeat x
          ys' = take n $ ys ++ repeat y

zipWithAll'' :: (a -> b -> c) -> Int -> Int -> a -> b -> [a] -> [b] -> [c]
zipWithAll'' f n m x y xs ys = zipWith f xs' ys'
    where k   = max n m
          xs' = take k $ xs ++ repeat x
          ys' = take k $ ys ++ repeat y

main :: IO ()
main = do
  let xSize = 1000000
      ySize = xSize * 2
  xs <- generate (vectorOf xSize arbitrary :: Gen [Int])
  ys <- generate (vectorOf ySize arbitrary :: Gen [Int])
  let impls = [ ("zipWithAll", zipWithAll (+) 0 0)
              , ("zipWithAll'", zipWithAll' (+) 0 0)
              , ("zipWithAll''", zipWithAll'' (+) xSize ySize 0 0)
              ]
  defaultMain [ bgroup "standalone" $ do
                  (name, f) <- impls
                  let f' (xs, ys) = f xs ys
                  pure . bench name $ nf f' (xs, ys)
              , bgroup "fused" $ do
                  (name, f) <- impls
                  let f' (xs, ys) = map succ (f (filter even xs) ys)
                  pure . bench name $ nf f' (xs, ys)
              ]

Understanding the performance of Haskell functions

Question

1 answers

solution1
4 ACCPTED 2022-12-08 21:46:15

Understanding the performance of Haskell functions

Question

1 answers

solution1 4 ACCPTED 2022-12-08 21:46:15

solution1
4 ACCPTED 2022-12-08 21:46:15