简体   繁体   中英

Eq testing for large DAG structures in Haskell

I'm new to Haskell ( a couple of months ). I have a Haskell program that assembles a large expression DAG (not a tree, a DAG), potentially deep and with multiple merging paths (ie, the number of different paths from root to leaves is huge). I need a fast way to test these dags for equality.The default Eq derivation will just recurse, exploring the same nodes multiple times. Currently this causes my program to take 60 seconds for relatively small expressions, and not even finish for larger ones. The profiler indicates it is busy checking equality most of the time. I would like to implement a custom Eq that does not have this problem. I don't have a way to solve this problem that does not involve a lot of rewriting. So I want to hear your thoughts.

My first attempt was to 'instrument' tree nodes with a hash that I compute incrementally, using Data.Hashable.hash , as I build the tree. This approach gave me an easy way to test two things aren't equal without looking deep into the structure. But often in this DAG, because of the paths in the DAG merging, the structures are indeed equal. So the hashes are equal, and I revert to full blown equality testing.

If I had a way to do physical equality, then a lot of my problems here would go away: if they are physically equal, then that's it. Otherwise if the hash is different then that's it. Only go deeper if they are physically not the same, but their hash agrees.

I could also imitate git, and compute a SHA1 per node to decide if they are equal period (no need to recurse). I know for a fact that this would help, because If I let equality be decided fully in terms of hash equality, then the program runs in tens milliseconds for the largest expressions. This approach also has the nice advantage that if for some reason there are two equal dags are not physically equal but are content-equal, I would be able to detect it fast in that case as well. (With Ids, Id still have to do a traversal at that point). So I like the semantics more.

This approach, however involves a lot more work than just calling the Data.Hashable.hash function, because I have to derive it for every variant of the dag node type. And moreover, I have multiple dag representations, with slightly different node definitions, so I would need to basically do this hashing trick thing twice or more if I decide to add more representations.

What would you do?

Part of the problem here is that Haskell has no concept of object identity, so when you say you have a DAG where you refer to the same node twice, as far as Haskell is concerned its just two values in different places on a tree. This is fundamentally different from the OO concept where an object is indexed by its location in memory, so the distinction between "same object" and "different objects with equal fields" is meaningful.

To solve your problem you need to detect when you are visiting the same object that you saw earlier, and in order to do that you need to have a concept of "same object" that is independent of the value. There are two basic ways to attack this:

  • Store all your objects in a vector (ie an array), and use the vector index as an object identity. Replace values with indices throughout your data structure.

  • Give each object a unique "identity" field so you can tell if you have seen this one before when traversing the DAG.

The former is how the Data.Graph module in the containers package does it. One advantage is that, if you have a single mapping from DAG to vector, then DAG equality becomes just vector equality.

Any efficient way to test for equality will be intertwined with the way you build up the DAG values.

Here is an idea which keeps track of all nodes ever created in a Map. As new nodes are added to the Map they are assigned a unique id.

Creating nodes now becomes monadic as you have thread this Map (and the next available id) throughout your computation.

In this example the nodes are implemented as Rose trees, and the order of the children is not significant - hence the call to sort in deriving the key into the map.

 import Control.Monad.State
 import Data.List
 import qualified Data.Map as M

 data Node = Node { _eqIdent:: Int      -- equality identifier
                  , _value :: String    -- value associated with the node
                  , _children :: [Node] -- children
                  }
   deriving (Show)

 type BuildState = (Int, M.Map (String,[Int]) Node)

 buildNode :: String -> [Node] -> State BuildState Node
 buildNode value nodes = do
   (nextid, nodeMap) <- get
   let key = (value, sort (map _eqIdent nodes))  -- the identity of the node
   case M.lookup key nodeMap of
     Nothing   -> do let n = Node nextid value nodes
                         nodeMap' = M.insert key n nodeMap
                     put (nextid+1, nodeMap')
                     return n
     Just node -> return node

 nodeEquality :: Node -> Node -> Bool
 nodeEquality a b = _eqIdent a == _eqIdent b

One caveat -- this approach requires that you know all the children of a node when you build it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM