简体   繁体   中英

GHC's RTS options for garbage collection

I have a Haskell program which processes a text file and builds a Map (with several million elements). The whole thing can run for 2-3 minutes. I found that tweaking the -H and -A options makes a big difference in running time.

There is documentation about this functionality of the RTS, but it's a hard read for me since I don't know the algorithms and terms from GC theory. I'm looking for a less technical explanation, preferably specific to Haskell/GHC. Are there any references about choosing sensible values for these options?

EDIT: That's the code, it builds a trie for a given list of words.

buildTrie :: [B.ByteString] -> MyDFA 
buildTrie l = fst3 $ foldl' step (emptyDFA, B.empty, 1) $ sort $ map B.reverse l where
    step :: (MyDFA , B.ByteString, Int) -> B.ByteString -> (MyDFA , B.ByteString, Int)
    step (dfa, lastWord, newIndex) newWord = (insertNewStates, newWord, newIndex + B.length newSuffix) where            
        (pref, lastSuffix, newSuffix) = splitPrefix lastWord newWord
        branchPoint = transStar dfa pref

        --new state labels for the newSuffix path
        newStates = [newIndex .. newIndex + B.length newSuffix - 1]
        --insert newStates
        insertNewStates = (foldl' (flip insertTransition) dfa $ zip3 (branchPoint:init newStates) (B.unpack newSuffix) newStates)

Generally speaking, garbage collection is a space/time tradeoff. Give the GC more space, and it will take less time. There are (many) other factors in play, cache in particular, but the space/time tradeoff is the most important one.

The tradeoff works like this: the program allocates memory until it reaches some limit (decided by the GC's automatic tuning parameters, or explicitly via RTS options). When the limit is reached, the GC traces all the data that the program is currently using, and reclaims all the memory used by data that is no longer required. The longer you can delay this process, the more data will have become unreachable ("dead") in the meantime, so the GC avoids tracing that data. The only way to delay a GC is to make more memory available to allocate into; hence more memory equals fewer GCs, equals lower GC overhead. Roughly speaking, GHC's -H option lets you set a lower bound on the amount of memory used by the GC, so can lower the GC overhead.

GHC uses a generational GC , which is an optimisation to the basic scheme, in which the heap is divided into two or more generations. Objects are allocated into the "young" generation, and the ones that live long enough get promoted into the "old" generation (in a 2-generation setup). The young generation is collected much more frequently than the old generation, the idea being that "most objects die young", so young-generation collections are cheap (they don't trace much data), but they reclaim a lot of memory. Roughly speaking, the -A option sets the size of the young generation - that is, the amount of memory that will be allocated before the young generation is collected.

The default for -A is 512k: it's a good idea to keep the young generation smaller than the L2 cache, and performance generally drops if you exceed the L2 cache size. But working in the opposite direction is the GC space/time tradeoff: using a very large young generation size might outweigh the cache benefits by reducing the amount of work the GC has to do. This doesn't always happen, it depends on the dynamics of the application, which makes it hard for the GC to automatically tune itself. The -H option also increases the young generation size, so can also have an adverse effect on cache usage.

The bottom line is: play around with the options and see what works. If you have plenty of memory to spare, you might well be able to get a performance increase by using either -A or -H, but not necessarily.

It is probably possible to reproduce the issue for smaller data sets, where it will be easier to see what is going on. In particular, I suggest to gain some familiarity with profiling:

Then, check whether the memory profiles you see match your expectations (you don't need to know about all the profiling options to get useful graphs). The combination of strict foldl' with a non-strict tuple as the accumulator would be the first thing I'd look at: if the components of the tuple are not forced, that recursion is building up unevaluated expressions.

Btw, you can build up a good intuition about such things by trying to evaluate your code by hand for really tiny data sets. A few iterations will suffice to see whether expressions get evaluated or remain unevaluated in the way that fits your application.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM