简体   繁体   English

分层数据结构设计(嵌套集)

[英]Hierarchical Data Structure Design (Nested Sets)

I'm working on a design for a hierarchical database structure which models a catalogue containing products (this is similar to this question ). 我正在设计一个分层数据库结构的设计,该结构模拟包含产品的目录(这与此问题类似)。 The database platform is SQL Server 2005 and the catalogue is quite large (750,000 products, 8,500 catalogue sections over 4 levels) but is relatively static (reloaded once a day) and so we are only concerned about READ performance. 数据库平台是SQL Server 2005,目录非常大(750,000个产品,4个级别的8,500个目录部分),但相对静态(每天重新加载一次),所以我们只关心READ性能。

The general structure of the catalogue hierarchy is:- 目录层次结构的一般结构是: -

  • Level 1 Section 1级部分
    • Level 2 Section 2级部分
      • Level 3 Section 3级部分
        • Level 4 Section (products are linked to here) 4级部分(产品链接到这里)

We are using the Nested Sets pattern for storing the hierarchy levels and storing the products which exist at that level in a separate linked table. 我们使用嵌套集模式来存储层次结构级别,并将存在于该级别的产品存储在单独的链接表中。 So the simplified database structure would be 因此,简化的数据库结构将是

CREATE TABLE CatalogueSection
(
    SectionID INTEGER,
    ParentID INTEGER,
    LeftExtent INTEGER,
    RightExtent INTEGER
)

CREATE TABLE CatalogueProduct
(
    ProductID INTEGER,
    SectionID INTEGER
)

We do have an added complication in that we have about 1000 separate customer groups which may or may not see all products in the catalogue. 我们确实有一个额外的复杂性,因为我们有大约1000个独立的客户群,这些客户群可能会也可能不会看到目录中的所有产品。 Because of this we need to maintain a separate "copy" of the catalogue hierarchy for each customer group so that when they browse the catalogue, they only see their products and they also don't see any sections which are empty. 因此,我们需要为每个客户组维护一个单独的“副本”目录层次结构,这样当他们浏览目录时,他们只能看到他们的产品,而且他们也看不到任何空的部分。

To facilitate this we maintain a table of the number of products at each level of the hierarchy "rolled up" from the section below. 为了便于实现这一点,我们在下面的部分“维护”了一个层次结构的每个级别的产品数量表。 So, even though products are only directly linked to the lowest level of the hierarchy, they are counted all the way up the tree. 因此,即使产品仅直接链接到层次结构的最低级别,它们也会在树中一直计算。 The structure of this table is 这个表的结构是

CREATE TABLE CatalogueSectionCount
(
    SectionID INTEGER,
    CustomerGroupID INTEGER,
    SubSectionCount INTEGER,
    ProductCount INTEGER
)

So, onto the problem Performance is very poor at the top levels of the hierarchy. 因此,对于问题 ,层次结构的顶层级别的性能非常差。 The general query to show the "top 10" products in the selected catalogue section (and all child sections) is taking somewhere in the region of 1 minute to complete. 显示所选目录部分(以及所有子部分)中“前10名”产品的一般查询正在1分钟内完成。 At lower sections in the hierarchy it is faster but still not good enough. 在层次结构的较低部分,它更快但仍然不够好。

I've put indexes (including covering indexes where applicable) on all key tables, run it through the query analyzer, index tuning wizard etc but still cannot get it to perform fast enough. 我已经在所有关键表上放置了索引(包括适用的覆盖索引),通过查询分析器,索引调整向导等运行它,但仍然无法使其执行得足够快。

I'm wondering whether the design is fundamentally flawed or whether it's because we have such a large dataset? 我想知道设计是否存在根本缺陷,还是因为我们有这么大的数据集? We have a reasonable development server (3.8GHZ Xeon, 4GB RAM) but it's just not working :) 我们有一个合理的开发服务器(3.8GHZ Xeon,4GB RAM),但它只是不工作:)

Thanks for any help 谢谢你的帮助

James 詹姆士

Use a closure table. 使用闭包表。 If your basic structure is a parent-child with the fields ID and ParentID, then the structure for a closure table is ID and DescendantID. 如果基本结构是具有字段ID和ParentID的父子结构,则闭包表的结构是ID和DescendantID。 In other words, a closure table is an ancestor-descendant table, where each possible ancestor is associated with all descendants. 换句话说,闭包表是祖先 - 后代表,其中每个可能的祖先与所有后代相关联。 You may include a LevelsBetween field if you need. 如果需要,您可以包含LevelsBetween字段。 Closure table implementations usually include self-referencing records, ie ID 1 is an ancestor of descendant ID 1 with LevelsBetween of zero. 闭包表实现通常包括自引用记录,即ID 1是子级ID 1的祖先,LevelsBetween为零。

Example: Parent/Child 示例:父/子
ParentID - ID ParentID - ID
1 - 2 1 - 2
1 - 3 1 - 3
3 - 4 3 - 4
3 - 5 3 - 5
4 - 6 4 - 6

Ancestor/Descendant 祖先/后代
ID - DescendantID - LevelsBetween ID - DescendantID - LevelsBetween
1 - 1 - 0 1 - 1 - 0
1 - 2 - 1 1 - 2 - 1
1 - 3 - 1 1 - 3 - 1
1 - 4 - 2 1 - 4 - 2
1 - 6 - 3 1 - 6 - 3
2 - 2 - 0 2 - 2 - 0
3 - 3 - 0 3 - 3 - 0
3 - 4 - 1 3 - 4 - 1
3 - 5 - 1 3 - 5 - 1
3 - 6 - 2 3 - 6 - 2
4 - 4 - 0 4 - 4 - 0
4 - 6 - 1 4 - 6 - 1
5 - 5 - 0 5 - 5 - 0

The table is intended to eliminate recursive joins. 该表旨在消除递归连接。 You push the load of the recursive join into an ETL cycle that you do when you load the data once a day. 您将递归连接的负载推送到每天加载一次数据时执行的ETL循环。 That shifts it away from the query. 这使它远离查询。

Also, it allows variable-level hierarchies. 此外,它允许变量级层次结构。 You won't be stuck at 4. 你不会被困在4。

Finally, it allows you to slot products in non-leaf nodes. 最后,它允许您在非叶节点中插入产品。 A lot of catalogs create "Miscellaneous" buckets at higher levels of the hierarchy to create a leaf-node to attach products to. 许多目录在层次结构的较高级别创建“杂项”存储桶,以创建将产品附加到的叶节点。 You don't need to do that since intermediate nodes are included in the closure. 您不需要这样做,因为闭包中包含中间节点。

As far as indexing goes, I would do a clustered index on ID/DescendantID. 就索引而言,我会在ID / DescendantID上做一个聚簇索引。

Now for your query performance. 现在为您的查询性能。 This takes a chunk out but not all. 这需要一大块但不是全部。 You mentioned a "Top 10". 你提到了“十大”。 This implies ranking over a set of facts that you haven't mentioned. 这意味着对您未提及的一组事实进行排名。 We need details to help tune those. 我们需要细节来帮助调整这些。 Plus, this gets only gets the leaf-level sections, not the products. 另外,这只能获得叶级部分,而不是产品。 At the very least, you should have an index on your CatalogueProduct that orders by SectionID/ProductID. 至少,您应该在CatalogueProduct上有一个索引,它按SectionID / ProductID排序。 I would force Section to Product joins to be loop joins based on the cardinality you provided. 我会根据您提供的基数强制Section to Product连接成为循环连接。 A report on a catalog section would go to the closure table to get descendants (using a clustered index seek). 关于目录部分的报告将转到闭包表以获取后代(使用聚簇索引搜索)。 That list of descendants would then be used to get products from CatalogueProduct using the index by looped index seeks. 然后,该子列表将用于通过循环索引查找使用索引从CatalogueProduct获取产品。 Then, with those products, you would get the facts necessary to do the ranking. 然后,使用这些产品,您将获得进行排名所需的事实。

您可以使用角色和treeId来解决客户群问题,但您必须向我们提供查询。

Might it be possible to calculate the ProductCount and SubSectionCount after the load each day? 可能在每天加载后计算ProductCount和SubSectionCount吗?
If the data is changing only once a day surely it's worthwhile to calculate these figures then, even if some denormalization is required. 如果数据每天只改变一次肯定值得计算这些数字,那么即使需要一些非规范化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM