简体   繁体   English

具有水平切片的分层数据的数据库结构

[英]Database Structure for hierarchical data with horizontal slices

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields. 我们目前正在尝试提高网站查询的性能,核心的分层数据结构有5个级别,每种类型都有大约20个字段。

level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)

We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in). 我们只有一个数据管道来执行所有这些更新和插入操作(即,我们完全控制输入的数据)。

The queries we need to do on this are: 我们需要对此执行的查询是:

fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children

We read from this datasource a lot, as-in hundreds of times a second. 我们从这个数据源中读取了大量信息,每秒读取数百次。 All of the queries we need to perform are known and optimised as well as they can be to the current data structure. 我们需要执行的所有查询都是已知的并已优化,并且可以针对当前数据结构进行优化。

We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable. 我们目前正在为此使用在memcached后面的MySQL查询,并且只是在进行其他查询以获取孩子/父母,我认为某种基于树的数据库或基于文档的数据库可能更合适。

My question is: what's the best way to model this data for efficient read performance? 我的问题是:为有效读取性能而对数据建模的最佳方法是什么?

Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. 听起来您的数据属于OLAP(在线分析处理)数据库。 The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. 您描述级别,分片和性能问题的方式似乎很适合OLAP。 It's probably modeled fine (not sure though), but you need a different tool to boost performance. 它可能建模不错(虽然不确定),但是您需要其他工具来提高性能。

I currently manage a system like this. 我目前正在管理这样的系统。 We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. 我们有一个用于输入的标准关系数据库,然后将相关数据复制以报告到OLAP服务器。 Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting. 我们的组合是Microsoft SQL Server(输入,原始数据),Microsoft Analysis Services(预先计算然后存储分析数据以提高速度)和Microsoft Excel / Access数据透视表和/或Tableau进行报告。

OLAP servers: http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers OLAP服务器: http : //en.wikipedia.org/wiki/Comparison_of_OLAP_Servers

Combining relational and OLAP: http://en.wikipedia.org/wiki/HOLAP 结合关系和OLAP: http : //en.wikipedia.org/wiki/HOLAP

Tableau: http://www.tableausoftware.com/ Tableau: http ://www.tableausoftware.com/

*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). * Tableau是一款出色的产品,如果您的数据不是非常大的话(甚至可以处理很多数据),它很可能可以替代OLAP服务器。 It will make local copies as necessary to improve performance. 它将在必要时制作本地副本以提高性能。 I strongly advise giving it a look. 我强烈建议您看一下。

If I've misunderstood the issue you're having, then by all means please ignore this answer :\\ 如果我误解了您遇到的问题,请务必忽略此答案:\\

UPDATE: After more discussion, an Object DB might be a solution as well. 更新:经过更多讨论,对象数据库也可能是一个解决方案。 Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member). 本质上,您的数据听起来是多维的,一种或另一种方式,但是我认为区别在于您是在进行分析性聚合计算和检索(SUM,AVG),还是仅存储和获取分类或关系数据(购物车)物品或家庭成员的朋友)。

ODBMS info: http://en.wikipedia.org/wiki/Object_database ODBMS信息: http : //en.wikipedia.org/wiki/Object_database

InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said. 我知道InterSystem的Cache是​​一个对象数据库,根据您所说的听起来更合适。

http://www.intersystems.com/cache/ http://www.intersystems.com/cache/

If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. 如果转换为其他系统不可行(完全可以理解),则可能必须查看规范化和查询正在处理的数据类型,以便进一步提高速度。 In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner). 实际上,这可能是过渡到另一种类型的系统之前的一个很好的第一步(对不起,我还没来得及)。

In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. 就我而言,我知道在MS SQL上,我们进行的从一些核心查询使用VARCHAR字段到使用INTEGER字段的切换对速度产生了巨大的影响。 Text data is one of the THE MOST expensive types of data to process. 文本数据是要处理的最昂贵的数据类型之一。 So for instance, if you have a query doing a lot of INNER JOIN s on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data. 因此,例如,如果您有一个查询在文本字段上执行很多INNER JOIN ,则可以考虑将其标准化到使用链接到文本数据的INTEGER ID的程度。

An example of high normalization could be using ID numbers for a person's First or Last Name. 高规范化的一个示例可能是将ID号用于一个人的名字或姓氏。 Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name. 大多数数据库设计都直接存储这些名称,并且不尝试减少重复,但是您可以规范化到“姓氏和/或名字”具有自己的表(或一个包含名字和姓氏的表)和ID的点。每个唯一的名称。

The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. 在您的案例中,重点在于性能而不是重复数据删除,但是从VARCHAR切换到INTEGER类的东西可能会带来巨大收益。 I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there. 我会首先在单个字段中进行尝试,测量前后的情况,然后从那里仔细做出决定。

And of course, in general you should be sure to have appropriate indexes on your data. 当然,通常,您应该确保在数据上具有适当的索引。

Hope that helps. 希望能有所帮助。

Document/Tree based database is designed to perform hierarchical queries. 基于文档/树的数据库旨在执行分层查询。 Do you have any hierarchical queries in your design -- I fail to see any? 您的设计中是否有任何分层查询-我看不到任何查询? Querying one level up and down doesn't count: it is a simple join. 向上和向下查询一个级别并不重要:这是一个简单的联接。 Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. 请记住,使用“基于文档/树的数据库”路由会损害您的一般查询能力。 To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition. 总而言之,只需雇用一名合格的数据库专家来分析您的性能瓶颈-通常可以通过添加平凡的索引来解决它们。

there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. 这里没有足够的信息说明有用的信息-您需要测量事物,查看“解释”等-但超出常规索引范围的一种选择是按级别3实例进行分片。 that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard. 这样可以在遇到最简单(独立磁盘)的不同分片的并行查询时为您提供更好的性能,或者如果您想在每个分片上投入更多资源,则可以使用单独的计算机。

the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this). 我真正提到这一点的唯一原因是您的用例建议在该级别的分片将很好地工作(如果您愿意的话,看起来很简单,可以在您的应用程序层中完成-我不知道mysql有哪些工具这个)。

and if your data volume isn't so high then with sharding you might be able to get it down to ssds... 如果您的数据量不是很高,那么通过分片,您也许可以将其降至ssds ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM