Haskell二叉树快速实现

Question

I implemented binary tree data structure in Haskell.我在 Haskell 中实现了二叉树数据结构。

My code:我的代码：

module Data.BTree where

data Tree a = EmptyTree 
                | Node a (Tree a) (Tree a)
                deriving (Eq, Ord, Read, Show)

emptyTree :: a -> Tree a  
emptyTree a = Node a EmptyTree EmptyTree

treeInsert :: (Ord a) => a -> Tree a -> Tree a
treeInsert x EmptyTree = emptyTree x
treeInsert x  (Node a left right) 
        | x == a = (Node x left right)
        | x < a =  (Node a (treeInsert x left) right)   
        | x > a =  (Node a left (treeInsert x right))


fillTree :: Int -> Tree Int -> Tree Int
fillTree  10000 tree = tree 
fillTree  x tree = let a = treeInsert x tree
                   in fillTree (x + 1) a

This code very slow.这段代码很慢。 I run:我跑：

fillTree 1 EmptyTree

I get: 50.24 secs我得到：50.24 秒

I try to implement this code in C language and my result of this test: 0m0.438s我尝试用 C 语言实现此代码，我的测试结果为：0m0.438s

Why so big difference?为什么差别这么大？ Is Haskell code rely so slow or my binary tree in haskell bad? Haskell 代码依赖这么慢还是我在 haskell 中的二叉树坏了？ I want to ask haskell guru maybe i can make my binary tree implementation more effective?我想问 haskell 大师，也许我可以让我的二叉树实现更有效？

Thank you.谢谢你。

Answer 1

First, another data point: The Set data structure in the Data.Set module happens to be a binary tree.首先，另一个数据点： Data.Set模块中的Set数据结构恰好是一棵二叉树。 I've translated your fillTree function to use it, instead:我已经翻译了您的fillTree function 来使用它，而不是：

import qualified Data.Set as Set
import Data.Set (Set)

fillSet :: Int -> Set Int -> Set Int
fillSet 10000 set = set
fillSet x set = let a = Set.insert x set
                in fillSet (x + 1) a

Running fillSet 1 Set.empty in GHCi, including a bit of extra computation to be sure that the entire result is evaluated, runs with no perceptible delay.在 GHCi 中运行fillSet 1 Set.empty ，包括一些额外的计算以确保评估整个结果，运行时没有明显的延迟。 So, this seems to indicate that the problem lies in your implementation.因此，这似乎表明问题出在您的实施中。

To start with, I suspect the biggest difference between using Data.Set.Set vs. your implementation is that if I'm reading your code correctly, you're not actually testing a binary tree.首先，我怀疑使用Data.Set.Set与您的实现之间的最大区别在于，如果我正确阅读了您的代码，那么您实际上并没有测试二叉树。 You're testing an over-complicated linked list--ie, a maximally unbalanced tree--as a result of inserting elements in increasing order.您正在测试一个过于复杂的链表——即，一个最大不平衡的树——作为以递增顺序插入元素的结果。 Data.Set.Set uses a balanced binary tree, which handles the pathological input better in this case. Data.Set.Set使用平衡二叉树，在这种情况下可以更好地处理病态输入。

We can also look at the definition of Set :我们还可以看一下Set的定义：

data Set a = Tip 
           | Bin {-# UNPACK #-} !Size a !(Set a) !(Set a)

Without going into too much detail, what this says is that tracks the size of the tree, and avoids a few less-than-useful layers of indirection that would otherwise exist in the data type.无需过多详细介绍，这就是说跟踪树的大小，并避免数据类型中存在的一些不太有用的间接层。

The full source of the Data.Set module can be found here ; Data.Set模块的完整源代码可以在这里找到； you may find it enlightening to study.你会发现学习很有启发性。

A few more observations, to demonstrate the difference between different ways of running it.再进行一些观察，以展示运行它的不同方式之间的差异。 I added the following to your code:我在您的代码中添加了以下内容：

toList EmptyTree = []
toList (Node x l r) = toList l ++ [x] ++ toList r

main = print . sum . toList $ fillTree 1 EmptyTree

This traverses the tree, sums the elements, and prints the total, which should ensure that everything is forced.这会遍历树，对元素求和并打印总数，这应该确保所有内容都是强制的。 My system is probably somewhat unusual so you may get rather different results trying this yourself, but relative differences should be accurate enough.我的系统可能有点不寻常，所以你自己尝试这个可能会得到相当不同的结果，但相对差异应该足够准确。 Some results:一些结果：

Using runhaskell , which should be roughly equivalent to running it in GHCi:使用runhaskell ，应该大致相当于在 GHCi 中运行它：
```
 real 1m36.055s user 0m0.093s sys 0m0.062s
```
Building with ghc --make -O0 :使用ghc --make -O0构建：
```
 real 0m3.904s user 0m0.030s sys 0m0.031s
```
Building with ghc --make -O2 :使用ghc --make -O2构建：
```
 real 0m1.765s user 0m0.015s sys 0m0.030s
```

Using my equivalent function based on Data.Set instead:使用基于Data.Set代替：

Using runhaskell :使用runhaskell ：

 real 0m0.521s user 0m0.031s sys 0m0.015s

Using ghc --make -O2 :使用ghc --make -O2 ：
```
 real 0m0.183s user 0m0.015s sys 0m0.031s
```

And the moral of today's story is: Evaluating expressions in GHCi and timing them with a stopwatch is a very, very bad way to test the performance of your code.今天故事的寓意是：评估 GHCi 中的表达式并用秒表计时是测试代码性能的非常非常糟糕的方法。

Answer 2

I doubt you implemented the same code in C.我怀疑你在 C 中实现了相同的代码。 You probably used a non-persistent tree structure instead.您可能改用了非持久树结构。 ~~That means you're comparing an O(n^2) algorithm in Haskell to an O(n) algorithm in C.~~ ~~这意味着您将 Haskell 中的 O(n^2) 算法与 C 中的 O(n) 算法进行比较。~~ Nevermind, the specific case you're using would be O(n^2) with a persistent structure or not.没关系，您使用的具体情况是 O(n^2) 是否具有持久结构。 There's just a lot more allocation with the persistent structure, so it's not a fundamental algorithmic difference.持久结构的分配要多得多，因此这不是基本的算法差异。

Additionally, it looks like you ran this from ghci.此外，看起来您是从 ghci 运行的。 That 'i' in "ghci" means "interpreter". “ghci”中的“i”表示“解释器”。 And yes, the interpreter can be tens or hundreds of times slower than compiled code.是的，解释器可能比编译代码慢几十或几百倍。 Try compiling it with optimizations and running it.尝试使用优化编译它并运行它。 ~~I suspect it'll still be slower due to fundamental algorithmic differences, but it won't be near 50 seconds.~~ ~~我怀疑由于基本的算法差异，它仍然会更慢，但不会接近 50 秒。~~

Haskell二叉树快速实现

问题描述

2 个解决方案

解决方案1
14 2011-07-22 17:35:21

解决方案2
6 已采纳 2011-07-22 16:53:42

Haskell二叉树快速实现

问题描述

2 个解决方案

解决方案1 14 2011-07-22 17:35:21

解决方案2 6 已采纳 2011-07-22 16:53:42

解决方案1
14 2011-07-22 17:35:21

解决方案2
6 已采纳 2011-07-22 16:53:42