简体   繁体   English

在关系数据库中存储分层数据的选项有哪些?

[英]What are the options for storing hierarchical data in a relational database?

Good Overviews好的概述

Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list).一般来说,您是在快速读取时间(例如,嵌套集)或快速写入时间(邻接表)之间做出决定。 Usually, you end up with a combination of the options below that best fit your needs.通常,您最终会得到最适合您需求的以下选项的组合。 The following provides some in-depth reading:以下提供了一些深入的阅读:

Options选项

Ones I am aware of and general features:我知道的和一般特征:

  1. Adjacency List :邻接列表
  • Columns: ID, ParentID列:ID、ParentID
  • Easy to implement.易于实施。
  • Cheap node moves, inserts, and deletes.廉价的节点移动、插入和删除。
  • Expensive to find the level, ancestry & descendants, path寻找关卡,祖先和后代,路径的成本很高
  • Avoid N+1 via Common Table Expressions in databases that support them通过支持它们的数据库中的公用表表达式避免 N+1
  1. Nested Set (aka Modified Preorder Tree Traversal )嵌套集(又名修改前序树遍历
  • Columns: Left, Right列:左,右
  • Cheap ancestry, descendants廉价的祖先,后代
  • Very expensive O(n/2) moves, inserts, deletes due to volatile encoding由于易失编码,非常昂贵的O(n/2)移动、插入、删除
  1. Bridge Table (aka Closure Table /w triggers ) 桥接表(又名闭包表 /w 触发器
  • Uses separate join table with ancestor, descendant, depth (optional)使用带有祖先、后代、深度的单独连接表(可选)
  • Cheap ancestry and descendants廉价的祖先和后代
  • Writes costs O(log n) (size of the subtree) for insert, updates, deletes插入、更新、删除的写入成本为O(log n) (子树的大小)
  • Normalized encoding: good for RDBMS statistics & query planner in joins规范化编码:适用于连接中的 RDBMS 统计信息和查询规划器
  • Requires multiple rows per node每个节点需要多行
  1. Lineage Column (aka Materialized Path , Path Enumeration) 沿袭列(又名物化路径,路径枚举)
  • Column: lineage (eg /parent/child/grandchild/etc...)列:血统(例如 /parent/child/grandchild/etc...)
  • Cheap descendants via prefix query (eg LEFT(lineage, #) = '/enumerated/path' )通过前缀查询的廉价后代(例如LEFT(lineage, #) = '/enumerated/path'
  • Writes costs O(log n) (size of the subtree) for insert, updates, deletes插入、更新、删除的写入成本为O(log n) (子树的大小)
  • Non-relational: relies on Array datatype or serialized string format非关系:依赖 Array 数据类型或序列化字符串格式
  1. Nested Intervals 嵌套区间
  • Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)像嵌套集,但使用实数/浮点数/小数,因此编码不是易失性的(廉价的移动/插入/删除)
  • Has real/float/decimal representation/precision issues有实数/浮点数/十进制表示/精度问题
  • Matrix encoding variant adds ancestor encoding (materialized path) for "free", but with the added trickiness of linear algebra.矩阵编码变体为“免费”添加了祖先编码(物化路径),但增加了线性代数的技巧。
  1. Flat Table平桌
  • A modified Adjacency List that adds a Level and Rank (eg ordering) column to each record.一个修改过的邻接列表,为每条记录添加一个级别和排名(例如排序)列。
  • Cheap to iterate/paginate over迭代/分页便宜
  • Expensive move and delete昂贵的移动和删除
  • Good Use: threaded discussion - forums / blog comments良好用途:线程讨论 - 论坛/博客评论
  1. Multiple lineage columns多个沿袭列
  • Columns: one for each lineage level, refers to all the parents up to the root, levels down from the item's level are set to NULL列:每个血统级别一个,指的是所有父级到根,从项目的级别向下的级别设置为 NULL
  • Cheap ancestors, descendants, level廉价的祖先,后代,等级
  • Cheap insert, delete, move of the leaves廉价的插入、删除、移动叶子
  • Expensive insert, delete, move of the internal nodes内部节点的昂贵插入、删除、移动
  • Hard limit to how deep the hierarchy can be层次结构深度的硬性限制

Database Specific Notes数据库特定说明

MySQL MySQL

Oracle甲骨文

PostgreSQL PostgreSQL

SQL Server SQL 服务器

  • General summary一般总结
  • 2008 offers HierarchyId data type that appears to help with the Lineage Column approach and expand the depth that can be represented. 2008 提供的HierarchyId数据类型似乎有助于使用 Lineage Column 方法并扩展可以表示的深度。

My favorite answer is as what the first sentence in this thread suggested.我最喜欢的答案是这个线程中的第一句话所建议的。 Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.使用邻接列表来维护层次结构并使用嵌套集来查询层次结构。

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets.到目前为止的问题是从邻接表到嵌套集的覆盖方法非常缓慢,因为大多数人使用称为“推送堆栈”的极端 RBAR 方法进行转换,并且被认为是昂贵的通过邻接表和嵌套集的出色性能达到维护简单的涅槃。 As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so.结果,大多数人最终不得不满足于一个或另一个,特别是如果有超过,比如说,糟糕的 100,000 个左右的节点。 Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.使用推送堆栈方法可能需要一整天的时间来转换传销者认为的百万级节点层次结构。

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible.我想我会给 Celko 带来一点竞争,想出一种方法以似乎不可能的速度将邻接列表转换为嵌套集。 Here's the performance of the push stack method on my i5 laptop.这是我的 i5 笔记本电脑上推送堆栈方法的性能。

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).这是新方法的持续时间(括号中的推送堆栈方法)。

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct.对,那是正确的。 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds. 100 万个节点在不到 1 分钟的时间内完成转换,100,000 个节点在 4 秒内完成。

You can read about the new method and get a copy of the code at the following URL.您可以阅读有关新方法的信息并在以下 URL 获取代码副本。 http://www.sqlservercentral.com/articles/Hierarchy/94040/ http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods.我还使用类似的方法开发了一个“预聚合”层次结构。 MLM'ers and people making bills of materials will be particularly interested in this article.传销者和制作物料清单的人会对本文特别感兴趣。 http://www.sqlservercentral.com/articles/T-SQL/94570/ http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.如果您确实停下来看看任何一篇文章,请跳转到“加入讨论”链接,让我知道您的想法。

Adjacency Model + Nested Sets Model邻接模型+嵌套集模型

I went for it because I could insert new items to the tree easily (you just need a branch's id to insert a new item to it) and also query it quite fast.我选择它是因为我可以轻松地将新项目插入到树中(你只需要一个分支的 id 来插入一个新项目)并且查询它的速度也很快。

+-------------+----------------------+--------+-----+-----+
| category_id | name                 | parent | lft | rgt |
+-------------+----------------------+--------+-----+-----+
|           1 | ELECTRONICS          |   NULL |   1 |  20 |
|           2 | TELEVISIONS          |      1 |   2 |   9 |
|           3 | TUBE                 |      2 |   3 |   4 |
|           4 | LCD                  |      2 |   5 |   6 |
|           5 | PLASMA               |      2 |   7 |   8 |
|           6 | PORTABLE ELECTRONICS |      1 |  10 |  19 |
|           7 | MP3 PLAYERS          |      6 |  11 |  14 |
|           8 | FLASH                |      7 |  12 |  13 |
|           9 | CD PLAYERS           |      6 |  15 |  16 |
|          10 | 2 WAY RADIOS         |      6 |  17 |  18 |
+-------------+----------------------+--------+-----+-----+
  • Every time you need all children of any parent you just query the parent column.每次您需要任何父级的所有子级时,您只需查询parent列。
  • If you needed all descendants of any parent you query for items which have their lft between lft and rgt of parent.如果您需要任何父项的所有后代,则查询其lft在父项的lftrgt之间的项目。
  • If you needed all parents of any node up to the root of the tree, you query for items having lft lower than the node's lft and rgt bigger than the node's rgt and sort the by parent .如果您需要直到树根的任何节点的所有父节点,则查询lft低于节点的lftrgt大于节点的rgt的项目,并按parent排序。

I needed to make accessing and querying the tree faster than inserts, that's why I chose this我需要比插入更快地访问和查询树,这就是我选择这个的原因

The only problem is to fix the left and right columns when inserting new items.唯一的问题right left well I created a stored procedure for it and called it every time I inserted a new item which was rare in my case but it is really fast.好吧,我为它创建了一个存储过程,并在每次插入一个新项目时调用它,这在我的情况下很少见,但它真的很快。 I got the idea from the Joe Celko's book, and the stored procedure and how I came up with it is explained here in DBA SE https://dba.stackexchange.com/q/89051/41481我从 Joe Celko 的书中得到了这个想法,DBA SE https://dba.stackexchange.com/q/89051/41481中解释了存储过程以及我是如何想出它的

This design was not mentioned yet:尚未提及此设计:

Multiple lineage columns多个沿袭列

Though it has limitations, if you can bear them, it's very simple and very efficient.虽然它有局限性,但如果你能承受它们,它是非常简单和非常有效的。 Features:特征:

  • Columns: one for each lineage level, refers to all the parents up to the root, levels below the current items' level are set to 0 (or NULL)列:每个血统级别一个,指的是直到根的所有父项,当前项级别以下的级别设置为0(或NULL)
  • There is a fixed limit to how deep the hierarchy can be层次结构的深度有一个固定限制
  • Cheap ancestors, descendants, level廉价的祖先,后代,等级
  • Cheap insert, delete, move of the leaves廉价的插入、删除、移动叶子
  • Expensive insert, delete, move of the internal nodes内部节点的昂贵插入、删除、移动

Here follows an example - taxonomic tree of birds so the hierarchy is Class/Order/Family/Genus/Species - species is the lowest level, 1 row = 1 taxon (which corresponds to species in the case of the leaf nodes):下面是一个示例 - 鸟类的分类树,因此层次结构是 Class/Order/Family/Genus/Species - 物种是最低级别,1 行 = 1 个分类单元(在叶节点的情况下对应于物种):

CREATE TABLE `taxons` (
  `TaxonId` smallint(6) NOT NULL default '0',
  `ClassId` smallint(6) default NULL,
  `OrderId` smallint(6) default NULL,
  `FamilyId` smallint(6) default NULL,
  `GenusId` smallint(6) default NULL,
  `Name` varchar(150) NOT NULL default ''
);

and the example of the data:以及数据示例:

+---------+---------+---------+----------+---------+-------------------------------+
| TaxonId | ClassId | OrderId | FamilyId | GenusId | Name                          |
+---------+---------+---------+----------+---------+-------------------------------+
|     254 |       0 |       0 |        0 |       0 | Aves                          |
|     255 |     254 |       0 |        0 |       0 | Gaviiformes                   |
|     256 |     254 |     255 |        0 |       0 | Gaviidae                      |
|     257 |     254 |     255 |      256 |       0 | Gavia                         |
|     258 |     254 |     255 |      256 |     257 | Gavia stellata                |
|     259 |     254 |     255 |      256 |     257 | Gavia arctica                 |
|     260 |     254 |     255 |      256 |     257 | Gavia immer                   |
|     261 |     254 |     255 |      256 |     257 | Gavia adamsii                 |
|     262 |     254 |       0 |        0 |       0 | Podicipediformes              |
|     263 |     254 |     262 |        0 |       0 | Podicipedidae                 |
|     264 |     254 |     262 |      263 |       0 | Tachybaptus                   |

This is great because this way you accomplish all the needed operations in a very easy way, as long as the internal categories don't change their level in the tree.这很好,因为这样您就可以非常轻松地完成所有需要的操作,只要内部类别不改变它们在树中的级别。

This is a very partial answer to your question, but I hope still useful.这是对您问题的一个非常部分的答案,但我希望仍然有用。

Microsoft SQL Server 2008 implements two features that are extremely useful for managing hierarchical data: Microsoft SQL Server 2008 实现了两个对管理分层数据非常有用的功能:

Have a look at "Model Your Data Hierarchies With SQL Server 2008" by Kent Tegels on MSDN for starts.查看 MSDN 上 Kent Tegels 撰写的“使用 SQL Server 2008 建模您的数据层次结构”以了解开始。 See also my own question: Recursive same-table query in SQL Server 2008另请参阅我自己的问题: SQL Server 2008 中的递归同表查询

If your database supports arrays, you can also implement a lineage column or materialized path as an array of parent ids.如果您的数据库支持数组,您还可以将沿袭列或物化路径实现为父 ID 数组。

Specifically with Postgres you can then use the set operators to query the hierarchy, and get excellent performance with GIN indices.特别是使用 Postgres,您可以使用集合运算符来查询层次结构,并通过 GIN 索引获得出色的性能。 This makes finding parents, children, and depth pretty trivial in a single query.这使得在单个查询中查找父母、孩子和深度变得非常简单。 Updates are pretty manageable as well.更新也很容易管理。

I have a full write up of using arrays for materialized paths if you're curious.如果你好奇的话,我有一个完整的关于使用数组作为物化路径的文章。

This is really a square peg, round hole question.这真的是一个方钉圆孔的问题。

If relational databases and SQL are the only hammer you have or are willing to use, then the answers that have been posted thus far are adequate.如果关系数据库和 SQL 是您拥有或愿意使用的唯一锤子,那么到目前为止发布的答案就足够了。 However, why not use a tool designed to handle hierarchical data?但是,为什么不使用旨在处理分层数据的工具呢? Graph database are ideal for complex hierarchical data.图数据库是复杂层次数据的理想选择。

The inefficiencies of the relational model along with the complexities of any code/query solution to map a graph/hierarchical model onto a relational model is just not worth the effort when compared to the ease with which a graph database solution can solve the same problem.与图形数据库解决方案可以轻松解决相同问题相比,关系模型的低效率以及将图形/层次模型映射到关系模型的任何代码/查询解决方案的复杂性都是不值得的。

Consider a Bill of Materials as a common hierarchical data structure.将物料清单视为一种常见的分层数据结构。

class Component extends Vertex {
    long assetId;
    long partNumber;
    long material;
    long amount;
};

class PartOf extends Edge {
};

class AdjacentTo extends Edge {
};

Shortest path between two sub-assemblies : Simple graph traversal algorithm.两个子组件之间的最短路径:简单的图遍历算法。 Acceptable paths can be qualified based on criteria.可接受的路径可以根据标准进行限定。

Similarity : What is the degree of similarity between two assemblies?相似度:两个程序集之间的相似度是多少? Perform a traversal on both sub-trees computing the intersection and union of the two sub-trees.对两个子树执行遍历,计算两个子树的交集和并集。 The percent similar is the intersection divided by the union.相似百分比是交集除以并集。

Transitive Closure : Walk the sub-tree and sum up the field(s) of interest, eg "How much aluminum is in a sub-assembly?"传递闭包:遍历子树并总结感兴趣的字段,例如“子组件中有多少铝?”

Yes, you can solve the problem with SQL and a relational database.是的,您可以使用 SQL 和关系数据库来解决问题。 However, there are much better approaches if you are willing to use the right tool for the job.但是,如果您愿意为工作使用正确的工具,还有更好的方法。

I am using PostgreSQL with closure tables for my hierarchies.我正在为我的层次结构使用带有闭包表的 PostgreSQL。 I have one universal stored procedure for the whole database:我有一个用于整个数据库的通用存储过程:

CREATE FUNCTION nomen_tree() RETURNS trigger
    LANGUAGE plpgsql
    AS $_$
DECLARE
  old_parent INTEGER;
  new_parent INTEGER;
  id_nom INTEGER;
  txt_name TEXT;
BEGIN
-- TG_ARGV[0] = name of table with entities with PARENT-CHILD relationships (TBL_ORIG)
-- TG_ARGV[1] = name of helper table with ANCESTOR, CHILD, DEPTH information (TBL_TREE)
-- TG_ARGV[2] = name of the field in TBL_ORIG which is used for the PARENT-CHILD relationship (FLD_PARENT)
    IF TG_OP = 'INSERT' THEN
    EXECUTE 'INSERT INTO ' || TG_ARGV[1] || ' (child_id,ancestor_id,depth) 
        SELECT $1.id,$1.id,0 UNION ALL
      SELECT $1.id,ancestor_id,depth+1 FROM ' || TG_ARGV[1] || ' WHERE child_id=$1.' || TG_ARGV[2] USING NEW;
    ELSE                                                           
    -- EXECUTE does not support conditional statements inside
    EXECUTE 'SELECT $1.' || TG_ARGV[2] || ',$2.' || TG_ARGV[2] INTO old_parent,new_parent USING OLD,NEW;
    IF COALESCE(old_parent,0) <> COALESCE(new_parent,0) THEN
      EXECUTE '
      -- prevent cycles in the tree
      UPDATE ' || TG_ARGV[0] || ' SET ' || TG_ARGV[2] || ' = $1.' || TG_ARGV[2]
        || ' WHERE id=$2.' || TG_ARGV[2] || ' AND EXISTS(SELECT 1 FROM '
        || TG_ARGV[1] || ' WHERE child_id=$2.' || TG_ARGV[2] || ' AND ancestor_id=$2.id);
      -- first remove edges between all old parents of node and its descendants
      DELETE FROM ' || TG_ARGV[1] || ' WHERE child_id IN
        (SELECT child_id FROM ' || TG_ARGV[1] || ' WHERE ancestor_id = $1.id)
        AND ancestor_id IN
        (SELECT ancestor_id FROM ' || TG_ARGV[1] || ' WHERE child_id = $1.id AND ancestor_id <> $1.id);
      -- then add edges for all new parents ...
      INSERT INTO ' || TG_ARGV[1] || ' (child_id,ancestor_id,depth) 
        SELECT child_id,ancestor_id,d_c+d_a FROM
        (SELECT child_id,depth AS d_c FROM ' || TG_ARGV[1] || ' WHERE ancestor_id=$2.id) AS child
        CROSS JOIN
        (SELECT ancestor_id,depth+1 AS d_a FROM ' || TG_ARGV[1] || ' WHERE child_id=$2.' 
        || TG_ARGV[2] || ') AS parent;' USING OLD, NEW;
    END IF;
  END IF;
  RETURN NULL;
END;
$_$;

Then for each table where I have a hierarchy, I create a trigger然后对于我有层次结构的每个表,我创建一个触发器

CREATE TRIGGER nomenclature_tree_tr AFTER INSERT OR UPDATE ON nomenclature FOR EACH ROW EXECUTE PROCEDURE nomen_tree('my_db.nomenclature', 'my_db.nom_helper', 'parent_id');

For populating a closure table from existing hierarchy I use this stored procedure:为了从现有层次结构中填充闭包表,我使用以下存储过程:

CREATE FUNCTION rebuild_tree(tbl_base text, tbl_closure text, fld_parent text) RETURNS void
    LANGUAGE plpgsql
    AS $$
BEGIN
    EXECUTE 'TRUNCATE ' || tbl_closure || ';
    INSERT INTO ' || tbl_closure || ' (child_id,ancestor_id,depth) 
        WITH RECURSIVE tree AS
      (
        SELECT id AS child_id,id AS ancestor_id,0 AS depth FROM ' || tbl_base || '
        UNION ALL 
        SELECT t.id,ancestor_id,depth+1 FROM ' || tbl_base || ' AS t
        JOIN tree ON child_id = ' || fld_parent || '
      )
      SELECT * FROM tree;';
END;
$$;

Closure tables are defined with 3 columns - ANCESTOR_ID, DESCENDANT_ID, DEPTH.闭包表定义为 3 列 - ANCESTOR_ID、DESCENDANT_ID、DEPTH。 It is possible (and I even advice) to store records with same value for ANCESTOR and DESCENDANT, and a value of zero for DEPTH.可以(我什至建议)存储 ANCESTOR 和 DESCENDANT 具有相同值的记录,而 DEPTH 的值为零。 This will simplify the queries for retrieval of the hierarchy.这将简化检索层次结构的查询。 And they are very simple indeed:它们确实非常简单:

-- get all descendants
SELECT tbl_orig.*,depth FROM tbl_closure LEFT JOIN tbl_orig ON descendant_id = tbl_orig.id WHERE ancestor_id = XXX AND depth <> 0;
-- get only direct descendants
SELECT tbl_orig.* FROM tbl_closure LEFT JOIN tbl_orig ON descendant_id = tbl_orig.id WHERE ancestor_id = XXX AND depth = 1;
-- get all ancestors
SELECT tbl_orig.* FROM tbl_closure LEFT JOIN tbl_orig ON ancestor_id = tbl_orig.id WHERE descendant_id = XXX AND depth <> 0;
-- find the deepest level of children
SELECT MAX(depth) FROM tbl_closure WHERE ancestor_id = XXX;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM