简体   繁体   中英

Database with “Open Schema” - Good or Bad Idea?

The co-founder of Reddit gave a presentation on issues they had while scaling to millions of users. A summary is available here .

What surprised me is point 3:

Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date. The Data table has three columns: thing id, key, value. There's a row for every attribute. There's a row for title, url, author, spam votes, etc. When they add new features they didn't have to worry about the database anymore. They didn't have to add new tables for new things or worry about upgrades.

This seems like a terrible idea to me, but it seems to have worked out for Reddit. Is it a good idea in general, though? Or is it a peculiarity of Reddit that happened to work out for them?

This is a data model known as EAV for entity-attribute-value . It has its uses. A prime example is patient test data which is naturally sparse since there are hundreds of thousands of tests which might be run, but typically only a handful are present for a patient. A table with hundreds of thousands of columns is silly, but a table with EAV makes good sense.

Most of the really big web sites end up using some sort of incredibly simple on the database side of things. This has the advantage that it's fast and scalable. It has the disadvantage that all the relationships that you'd get the database to enforce automatically (via triggers and such) you need to enforce yourself in your client code instead. Maintaining consistency is a pain in the neck, and there's almost always at least some chance that your data will be inconsistent, at least for short periods of time.

For a social networking site, it's a worthwhile compromise. Data that's mostly right most of the time is adequate (eg, who really cares if the number of up-votes you receive for an item is really 20 milliseconds out of date when it's sent), and keeping costs reasonable while scaling to support a gazillion users matters a lot.

I noted that they did not mention anything about the ease or difficulty in creating reports against that data. When used in a narrow set of circumstances, EAVs can be beneficial. As a central part of most systems it will become a nightmare when you hit reporting. The problem with EAVs is that most of the benefit is at the outset of the project and most of the pain is later in the analysis and reporting especially due to the severe lack of data integrity. "Not having to worry about foreign keys" to me sounds like a nightmare of orphaned rows. Add in the use of surrogate keys for everything and you have a tangled morass which generally ends in a complete rewrite

We worked in a similar problem not long ago, i could say at first it wasn't easy and fun, but after some point that you get used to it, it has it's own benefit, it's like developing another Database withing your tables, in some area it's an overkill task, but when you pass these levels, it provide you with lots of functionality, basically after one point, we didn't create any new table, and we just created dynamic forms for everything's, even for our own programming tasks. as for performance, system didn't get million of rows to be fair comparison, but for daily usage i never noticed any differences. some problems i want to share.

  1. we didn't delete any rows we just hide them and setting a flag, and a nightly (weekly) service clean the physical rows
  2. orphan rows, we basically didn't care about cleaning childs, we just set "IsDeleted" On father, and nightly service would clean every rows that is orphan or not needed anymore.

3.you should keep your indexes up to date, but you should skip of building them whenever possible (again nightly service keep index up to date)

  1. we prepared our reporting data ahead of time (AOT) which means we were behind of actual data here :))

we tried hard no to jump in ocean of rows to calc some values by user demand. if we prepared it you can use it, if not then you cant

at the end, there are so many unique challenges to this approach that you should find a way to solve, problems you never faced before in your routine job, but after all of these you earn more flexibility that you can spend.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM