简体   繁体   中英

Database Structure for hierarchical data with horizontal slices

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.

level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)

We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).

The queries we need to do on this are:

fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children

We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.

We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.

My question is: what's the best way to model this data for efficient read performance?

Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.

I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.

OLAP servers: http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers

Combining relational and OLAP: http://en.wikipedia.org/wiki/HOLAP

Tableau: http://www.tableausoftware.com/

*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.

If I've misunderstood the issue you're having, then by all means please ignore this answer :\\

UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).

ODBMS info: http://en.wikipedia.org/wiki/Object_database

InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.

http://www.intersystems.com/cache/

If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).

In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOIN s on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.

An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.

The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.

And of course, in general you should be sure to have appropriate indexes on your data.

Hope that helps.

Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.

there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.

the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).

and if your data volume isn't so high then with sharding you might be able to get it down to ssds...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM