有没有办法避免在插入时复制二叉树的整个搜索路径？

Question

I've just started working my way through Okasaki's Purely Functional Data Structures , but have been doing things in Haskell rather than Standard ML. 我刚刚开始研究Okasaki的Purely Functional Data Structures ，但是一直在Haskell而不是标准ML中做事。 However, I've come across an early exercise (2.5) that's left me a bit stumped on how to do things in Haskell: 但是，我遇到了一个早期练习（2.5），让我对如何在Haskell中做事情感到有点困惑：

Inserting an existing element into a binary search tree copies the entire search path even though the copied nodes are indistinguishable from the originals. 将现有元素插入二叉搜索树会复制整个搜索路径，即使复制的节点与原始节点无法区分。 Rewrite insert using exceptions to avoid this copying. 使用异常重写插入以避免此复制。 Establish only one handler per insertion rather than one handler per iteration. 每次插入只建立一个处理程序，而不是每次迭代一个处理程序。

Now, my understanding is that ML, being an impure language, gets by with a conventional approach to exception handling not so different to, say, Java's, so you can accomplish it something like this: 现在，我的理解是，作为一种不纯洁的语言，ML通过传统的异常处理方法得到了解，而不是像Java那样，所以你可以做到这样的事情：

type Tree = E | T of Tree * int * Tree

exception ElementPresent

fun insert (x, t) = 
  let fun go E = T (E, x, E)
      fun go T(l, y, r) = 
             if      x < y then T(go (l), x, r)
             else if y < x then T(l, x, go (r))
             else    raise ElementPresent
  in go t
  end 
  handle ElementPresent => t

I don't have an ML implementation, so this may not be quite right in terms of the syntax. 我没有ML实现，所以这在语法方面可能不太正确。

My issue is that I have no idea how this can be done in Haskell, outside of doing everything in the IO monad, which seems like cheating and even if it's not cheating, would seriously limit the usefulness of a function which really doesn't do any mutation. 我的问题是我不知道如何在Haskell中做到这一点，除了在IO monad中做所有事情，这似乎是作弊，即使它不作弊，也会严重限制一个真正不做的函数的用处任何突变。 I could use the Maybe monad: 我可以使用Maybe monad：

data Tree a = Empty | Fork (Tree a) a (Tree a)
        deriving (Show)

insert     :: (Ord a) => a -> Tree a -> Tree a
insert x t = maybe t id (go t)
  where go Empty   = return (Fork Empty x Empty)
    go (Fork l y r)
      | x < y     = do l' <- go l; return (Fork l' y r)
      | x > y     = do r' <- go r; return (Fork l y r')
      | otherwise = Nothing

This means everything winds up wrapped in Just on the way back up when the element isn't found, which requires more heap allocation, and sort of defeats the purpose. 这意味着一切卷起包裹在Just回来的路上了，当元素是找不到的，这就需要更多的堆分配和排序的失败的目的。 Is this allocation just the price of purity? 这种分配只是纯度的价格吗？

EDIT to add: A lot of why I'm wondering about the suitability of the Maybe solution is that the optimization described only seems to save you all the constructor calls you would need in the case where the element already exists, which means heap allocations proportional to the length of the search path. 编辑添加：很多为什么我想知道Maybe解决方案的适用性，所描述的优化似乎只能保存你在元素已经存在的情况下需要的所有构造函数调用，这意味着堆分配成比例到搜索路径的长度。 The Maybe also avoids those constructor calls when the element already exists, but then you get a number of Just constructor calls equal to the length of the search path. 当元素已经存在时， Maybe也会避免那些构造函数调用，但是你会得到一些Just构造函数调用，它们等于搜索路径的长度。 I understand that a sufficiently smart compiler could elide all the Just allocations, but I don't know if, say, the current version of GHC is really that smart. 我知道一个足够聪明的编译器可能会忽略所有Just分配，但我不知道，例如，当前版本的GHC是否真的那么聪明。

Answer 1

GHC generally cannot elide path copying in cases like that. 在这种情况下，GHC通常不能避免路径复制。 However, there is a way to do it manually, without incurring any of the indirection/allocation costs of Maybe . 但是，有一种方法可以手动完成，而不会产生Maybe任何间接/分配成本。 Here it is: 这里是：

{-# LANGUAGE MagicHash #-}

import GHC.Prim (reallyUnsafePtrEquality#)

data Tree a = Empty | Fork (Tree a) a (Tree a)
        deriving (Show)

insert :: (Ord a) => a -> Tree a -> Tree a
insert x Empty = Fork Empty x Empty
insert x node@(Fork l y r)
    | x < y = let l' = insert x l in 
        case reallyUnsafePtrEquality# l l' of
            1# -> node
            _  -> Fork l' y r
    | x > y = let r' = insert x r in
        case reallyUnsafePtrEquality# r r' of
            1# -> node
            _  -> Fork l y r'
    | otherwise = node

The pointer equality function does exactly what's in the name. 指针等式函数完全符合名称中的内容。 Here it is safe because even if the equality returns a false negative we only do a bit of extra copying, and nothing worse happens. 这里是安全的，因为即使相等性返回假阴性，我们只进行一些额外的复制，没有更糟糕的事情发生。

It's not the most idiomatic or prettiest Haskell, but the performance benefits can be significant. 它不是最惯用或最漂亮的Haskell，但性能优势可能非常显着。 In fact, this trick is used very frequently in unordered-containers . 事实上，这种技巧在unordered-containers经常使用。

Answer 2

In terms of cost, the ML version is actually very similar to your Haskell version. 在成本方面，ML版本实际上与您的Haskell版本非常相似。

Every recursive call in the ML version results in a stack frame. ML版本中的每个递归调用都会产生堆栈帧。 The same is true in the Haskell version. Haskell版本也是如此。 This is going to be proportional in size to the path that you traverse in the tree. 这将与您在树中遍历的路径大小成比例。 Also, both versions will of course allocate new nodes for the entire path if an insertion is actually performed. 此外，如果实际执行插入，两个版本当然将为整个路径分配新节点。

In your Haskell version, every recursive call might also eventually result in the allocation of a Just node. 在您的Haskell版本中，每个递归调用最终也可能导致Just节点的分配。 This will go on the minor heap, which is just a block of memory with a bump pointer. 这将进入次要堆，这只是一个带有凹凸指针的内存块。 For all practical purposes, GHC's minor heap is roughly equivalent in cost to the stack. 出于所有实际目的，GHC的次要堆与堆栈的成本大致相当。 Since these are short-lived allocations, they won't normally end up being moved to the major heap at all. 由于这些是短期分配，它们通常不会最终被移动到主堆。

Answer 3

As fizruk indicates, the Maybe approach is not significantly different from what you'd get in Standard ML. 正如fizruk所指出的那样， Maybe方法与您在Standard ML中获得的方法没有显着差异。 Yes, the whole path is copied, but the new copy is discarded if it turns out not to be needed. 是的，复制了整个路径，但如果不需要则会丢弃新副本。 The Just constructor itself may not even be allocated on the heap—it can't escape from insert , let alone the module, and you don't do anything weird with it, so the compiler is free to analyze it to death. Just构造函数本身甚至可能不会在堆上分配 - 它无法从insert逃脱，更不用说模块，并且你不会做任何奇怪的事情，因此编译器可以自由地将其分析为死亡。

Edit 编辑

There are efficiency problems, now that I think of it. 现在我想到了效率问题。 Your use of Maybe conceals the fact that you're actually making two passes—one down to find the insertion point and one up to build the tree. 你对Maybe使用掩盖了你实际上正在进行两次传递的事实 - 一次向下找到插入点，另一次用于构建树。 The solution to this is to drop Maybe Tree in favor of (Tree,Bool) and use strictness annotations, or to switch to continuation-passing style. 对此的解决方案是删除Maybe Tree以支持(Tree,Bool)并使用严格注释，或者切换到continuation-passing风格。 Also, if you choose to stay with the three-way logic, you may want to use the three-way comparison function. 此外，如果您选择使用三向逻辑，则可能需要使用三向比较功能。 Alternatively, you can go all the way to the bottom each time and check later if you hit a duplicate. 或者，您可以每次都一直到底部，如果您重复一次，可以稍后检查。

Answer 4

If you have a predicate that checks whether the key is already in the tree, you can look before you leap: 如果你有一个谓词来检查密钥是否已经在树中，你可以在跳跃之前查看：

insert x t  =  if contains t x then t else insert' x t

This traverses the tree twice, of course. 当然，这会遍历树两次。 Whether that's as bad as it sounds should be determined empirically: it might just load the relevant part of the tree into the cache. 是否与声音一样糟糕应该凭经验确定：它可能只是将树的相关部分加载到缓存中。

有没有办法避免在插入时复制二叉树的整个搜索路径？

问题描述

4 个解决方案

解决方案1
5 2014-05-22 16:25:24

解决方案2
5 已采纳 2014-05-22 17:29:24

解决方案3
2 2014-05-22 14:44:02

Edit 编辑

解决方案4
0 2014-05-22 14:53:33

有没有办法避免在插入时复制二叉树的整个搜索路径？

问题描述

4 个解决方案

解决方案1 5 2014-05-22 16:25:24

解决方案2 5 已采纳 2014-05-22 17:29:24

解决方案3 2 2014-05-22 14:44:02

Edit 编辑

解决方案4 0 2014-05-22 14:53:33

解决方案1
5 2014-05-22 16:25:24

解决方案2
5 已采纳 2014-05-22 17:29:24

解决方案3
2 2014-05-22 14:44:02

解决方案4
0 2014-05-22 14:53:33