基于物理身份的替代Hashtbl.hash

Question

I'm trying to derive a Graphviz file describing a structured value. 我正在尝试派生一个描述结构化值的Graphviz文件。 This is for diagnostic purposes so I want my graph to mirror the actual structure in memory as closely as possible. 这是出于诊断目的，所以我希望我的图形尽可能地镜像内存中的实际结构。 I'm using the below to map values to Graphviz vertices so that I can reuse a vertex when a value has two or more inbound references: 我正在使用下面的值将值映射到Graphviz顶点，这样当值有两个或多个入站引用时我可以重用一个顶点：

let same = (==)

module StateIdentity : Hashtbl.HashedType = struct
  type t = R.meta_t state
  let hash = Hashtbl.hash
  let equal = same
end

module StateHashtbl = Hashtbl.Make (StateIdentity)

The documentation for Hashtbl.hash suggests that it is suitable for use both when StateIdentity.equal = (=) and when StateIdentity.equal = (==) but I'd like to ensure that hash table access is as close to O(1) as possible so would rather not have Hashtbl.hash walking a (potentially large in this case) object graph on every lookup. Hashtbl.hash的文档表明它适合在StateIdentity.equal = (=)和StateIdentity.equal = (==)但我想确保哈希表访问与O（1）接近）尽可能不要让Hashtbl.hash在每次查找时都行走（在这种情况下可能很大）对象图。

I know Ocaml moves references around, but is there an O(1) proxy for reference identity available in Ocaml? 我知道Ocaml会移动引用，但Ocaml中是否有O（1）代理参考标识？

The answer to Hashtable of mutable variable in Ocaml suggests not. Ocaml中可变变量Hashtable的答案表明没有。

I'm loathe to attach serial numbers to states, since this is diagnostic code so any errors I make doing that have the potential to mask other bugs. 我不喜欢将序列号附加到状态，因为这是诊断代码所以我做的任何错误都有可能掩盖其他错误。

Answer 1

If you are using the word "object" in the sense of OCaml's < ... > object types, then you can use Oo.id to get a unique integer identity for each instance. 如果您在OCaml的< ... >对象类型的意义上使用“对象”一词，那么您可以使用Oo.id为每个实例获取唯一的整数标识。 Otherwise the answer to "is there a general proxy for value identity" is "no". 否则，“是否存在价值认同的一般代理”的答案是“否”。 In this case my advice would be to start with Hashtbl.hash , evaluate whether it fits your need, and otherwise design your own hashing function. 在这种情况下，我的建议是从Hashtbl.hash开始，评估它是否符合您的需要，并以其他方式设计您自己的散列函数。

You can also play with Hashtbl.hash_param (see documentation ) to turn knob on value traversals during hashing. 您还可以使用Hashtbl.hash_param （请参阅文档）在散列期间打开值遍历上的旋钮。 Note that the Hashtbl code uses linked lists for bucket of same-hash values, so having lots of hash conflicts will trigger linear search behavior. 请注意，Hashtbl代码使用链接列表来存储相同哈希值的存储区，因此存在大量哈希冲突将触发线性搜索行为。 It may be better to move to other implementations using binary search trees for conflict buckets. 使用二进制搜索树转移到冲突桶的其他实现可能更好。 But then again, you should evaluate your situation before moving to more complex (and with worse performances in the "good case") solutions. 但话又说回来，你应该先评估你的情况，然后再转向更复杂的（并且在“好的情况下”）解决方案中表现更差。

Answer 2

I've found it very tricky to use physical equality to do hashing. 我发现使用物理相等来进行散列非常棘手。 You certainly can't use something like the address of the value as your hash key, because (as you say) things get moved around by GC. 你当然不能使用类似地址的东西作为你的哈希键，因为（如你所说）事情会被GC移动。 Once you have a hash key, it seems like you can use physical equality to do comparisons as long as your values are mutable. 一旦你有了一个哈希键，只要你的值是可变的，你似乎可以使用物理相等来进行比较。 If your values aren't mutable, OCaml doesn't guarantee much about the meaning of (==). 如果您的值不可变，则OCaml不能保证（==）的含义。 In practical terms, immutable objects that are equal (=) can theoretically be merged into a single physical object if the OCaml compiler or runtime wishes (or vice versa). 实际上，如果OCaml编译器或运行时希望（或反之亦然），理论上可以将等于（=）的不可变对象合并为单个物理对象。

When I work through the various possibilities, I usually end up putting a sequence number into my values when I need a unique id. 当我处理各种可能性时，我通常最终会在需要唯一ID时将序列号放入我的值中。 As gasche says, you can use Oo.id if your values are actual OO-style objects. Gasche说，如果你的值是实际的OO风格的对象，你可以使用Oo.id

Answer 3

Like others, I think unique IDs are the way to go. 和其他人一样，我认为唯一的ID是可行的方式。

Unique IDs are not hard to generate safely. 唯一ID不难安全生成。 One solution is to use a so-called private record as follows. 一种解决方案是使用如下所谓的私人记录。 It prevents users of the module from copying the id field: 它会阻止模块的用户复制id字段：

module type Intf =
sig
  type t = private {
    id : int;
    foo : string;
  }

  val create_t : foo: string -> t
end

module Impl : Intf =
struct
  type t = {
    id : int;
    foo : string;
  }

  let create_id =
    let n = ref 0 in
    fun () ->
      if !n = -1 then
        failwith "Out of unique IDs"
      else (
        incr n;
        !n
      )

  let create_t ~foo = {
    id = create_id ();
    foo
  }
end

Answer 4

Sorry for the ugly hack, but I made something like that some time ago. 对于丑陋的黑客很抱歉，但不久前我做了类似的事情。

The trick about that is to ensure that values won't be moved in memory after inserting in the table. 关于这一点的诀窍是确保在插入表格后不会在内存中移动值。 There are two situations that can move values in memory: copy from the minor to the major heap and major heap compaction. 有两种情况可以在内存中移动值：从次要复制到主堆和主堆压缩。 That means that when you insert a value in the table, it must be in the major heap and between two operations on the table you must ensure that no compaction happened. 这意味着当您在表中插入一个值时，它必须位于主堆中，并且在表上的两个操作之间必须确保没有发生压缩。

Checking that the value is in the minor heap can be done using the C function is_young, if it is the case, you can force the value to migrate to the major heap using Gc.minor (). 检查值是否在次要堆中可以使用C函数is_young来完成，如果是这种情况，您可以使用Gc.minor（）强制该值迁移到主堆。

For the second problem, you can either completely deactivate compactions or rebuild the table on compactions. 对于第二个问题，您可以完全停用压缩或在压缩上重建表。 Disabling it can be done using 禁用它可以使用

Gc.set { Gc.get () with Gc.max_overhead = max_int }

Detecting that a compaction happened can be done by comparing at each acces to the table the number returned by 检测到发生压缩可以通过在每次访问表时比较返回的数字来完成

( Gc.quick_stat () ).Gc.compactions

Notice that you must be disable the compaction before accessing the table. 请注意，在访问表之前必须禁用压缩。 If you disable compaction you should also consider changing the allocation policy to avoid unbounded fragmentation of the heap. 如果禁用压缩，还应考虑更改分配策略以避免堆的无限碎片。

Gc.set {(Gc.get ()) with Gc.allocation_policy = 1}

If you want something really ugly in old versions of OCaml (before 4.00) the compaction kept the value in the same order in memory, so you could implement a set or map based on physical address without worrying. 如果你想在旧版本的OCaml（在4.00之前）中看到一些非常丑陋的东西，那么压缩会使内存中的值保持相同的顺序，因此您可以基于物理地址实现一个集合或映射而不必担心。

基于物理身份的替代Hashtbl.hash

问题描述

4 个解决方案

解决方案1
6 已采纳 2012-10-24 15:49:56

解决方案2
5 2012-10-24 15:59:54

解决方案3
4 2012-10-24 17:02:13

解决方案4
3 2012-10-24 21:00:43

基于物理身份的替代Hashtbl.hash

问题描述

4 个解决方案

解决方案1 6 已采纳 2012-10-24 15:49:56

解决方案2 5 2012-10-24 15:59:54

解决方案3 4 2012-10-24 17:02:13

解决方案4 3 2012-10-24 21:00:43

解决方案1
6 已采纳 2012-10-24 15:49:56

解决方案2
5 2012-10-24 15:59:54

解决方案3
4 2012-10-24 17:02:13

解决方案4
3 2012-10-24 21:00:43