Efficient parallel strategies

Question

I'm trying to wrap my head around parallel strategies. I think I understand what each of the combinators do, but every time I try using them with more than 1 core, the program slows considerably.

For example a while back I tried to calculate histograms (and from them unique words) from ~700 documents. I thought that using file level granularity would be ok. With -N4 I get 1.70 work balance. However with -N1 it runs in half the time than it does with -N4 . I'm not sure what the question is really, but I'd like to know how to decide where/when/how to parallelize and gain some understanding on it. How would this be parallelized so that the speed increases with cores instead of decreasing?

import Data.Map (Map)
import qualified Data.Map as M
import System.Directory
import Control.Applicative
import Data.Vector (Vector)
import qualified Data.Vector as V
import qualified Data.Text as T
import qualified Data.Text.IO as TI
import Data.Text (Text)
import System.FilePath ((</>))
import Control.Parallel.Strategies
import qualified Data.Set as S
import Data.Set (Set)
import GHC.Conc (pseq, numCapabilities)
import Data.List (foldl')

mapReduce stratm m stratr r xs = let
  mapped = parMap stratm m xs
  reduced = r mapped `using` stratr
  in mapped `pseq` reduced

type Histogram = Map Text Int

rootDir = "/home/masse/Documents/text_conversion/"

finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]
isStopWord :: Text -> Bool
isStopWord x = x `elem` (finnishStop ++ englishStop)

textFiles :: IO [FilePath]
textFiles = map (rootDir </>) . filter (not . meta) <$> getDirectoryContents rootDir
  where meta "." = True
        meta ".." = True
        meta _ = False

histogram :: Text -> Histogram
histogram = foldr (\k -> M.insertWith' (+) k 1) M.empty . filter (not . isStopWord) . T.words

wordList = do
  files <- mapM TI.readFile =<< textFiles
  return $ mapReduce rseq histogram rseq reduce files
  where
    reduce = M.unions

main = do
  list <- wordList
  print $ M.size list

As for the text files, I'm using pdfs converted to text files so I can't provide them, but for the purpose, almost any book/books from project gutenberg should do.

Edit : Added imports to script

Answer 1

I think Daniel got it right -- the Data.Map and lists are a lazy data structures; you should use both foldl' and insertWith' to ensure work for each chunk is done eagerly --- otherwise all work is delayed to the sequential part (reduce).

Also it is not obvious that making a spark for each file is the right granularity, particularly if file sizes differ substantially. If that can be the case, it would be preferable to concatenate the word lists and split in even-sized chunks (see parListChunk combinator).

While you're at it, I would also look at some pitfalls of using lazy IO (readFile) to open many files (the runtime systme might run out of file handles because it holds onto them for too long).

Answer 2

In practice, getting the parallel combinators to scale well can be difficult. Others have mentioned making your code more strict to ensure you are actually doing the work in parallel, which is definitely important.

Two things that can really kill performance are lots of memory traversal and garbage collections. Even if you are not producing a lot of garbage, lots of memory traversals put more pressure on the CPU cache and eventually your memory bus becomes the bottle neck. Your isStopWord function performs a lot of string comparisons and has to traverse a rather long linked list to do so. You can save a lot of work using the builtin Set type or, even better, the HashSet type from the unordered-containers package (since repeated string comparisons can be expensive, especially if they share commons prefixes).

import           Data.HashSet                (HashSet)
import qualified Data.HashSet                as S

...

finnishStop :: [Text]
finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop :: [Text]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]

stopWord :: HashSet Text
stopWord = S.fromList (finnishStop ++ englishStop)

isStopWord :: Text -> Bool
isStopWord x = x `S.member` stopWord

Replacing your isStopWord function with this version performs much better and scales much better (though definitely not 1-1). You could also consider using HashMap (from the same package) rather than Map for the same reasons, but I did not get a noticeable change from doing so.

Another option is to increase the default heap size to take some of the pressure off the GC and to give it more room to move things around. Giving the compiled code a default heap size of 1GB ( -H1G flag), I get a GC balance of about 50% on 4 cores, whereas I only get ~25% without (it also runs ~30% faster).

With these two alterations, the average runtime on four cores (on my machine) drops from ~10.5s to ~3.5s. Arguably, there is room for improvement based on the GC statistics (still only spends 58% of the time doing productive work), but doing significantly better might require a much more drastic change to your algorithm.

Efficient parallel strategies

Question

2 answers

solution1
4 2013-01-31 14:02:31

solution2
4 ACCPTED 2013-01-31 16:34:10

Efficient parallel strategies

Question

2 answers

solution1 4 2013-01-31 14:02:31

solution2 4 ACCPTED 2013-01-31 16:34:10

solution1
4 2013-01-31 14:02:31

solution2
4 ACCPTED 2013-01-31 16:34:10