简体   繁体   中英

Which DB should I use?

I am now building an application which should store and handle large amounts of data. So now I'm struggling with the question - which DB should I use.

My requirements are:

  • Handle up to ~100,000 insert commands a second (sometimes several ones from different threads). 100,000 is the peak; Most of the time the amount would be between hundreds to a few thousands.
  • Store millions of records.
  • Query the data as quickly as possible.
  • Part of the data properties change for every entity, which fits non-relational database behavior more than relational ones. However, the sum of possible properties is not huge, so it can be presented as columns in a relational database (if it's much faster this way).
  • Update commands will rarely occur.

Which DB would you recommend me to use?

Thanks!

Update: The OS I'm using isn't Windows. I thought that if SQL Server would be the most recommended DB then I might switch but from your responses, this is not the case.

Regarding the budget - I will start with the cheapest option and I guess that this will change once the company has more money and more users.

No one has recommended no-sql databases. Are they really that bad for this kind of requirements?

The answer depeneds on asking additional questions, such as how much you want to spend, what OS you are using, and what expertise you have in-house.

Database that I know of that can handle such a massive scale include: DB2, Oracle, Teradata, and SQL Server. MySQL may also be an option, though I'm not sure of its performance capabilities.

There are others, I'm sure, designed for handling data on the massive scale you are suggesting, and you may need to look into those, as well.

So, if your OS is not Windows, you can exclude SQL Server.

If you are going on the cheap, MySQL may be the option.

DB2 and Oracle are both mature database systems. If your system is mainframe (IBM 370), I'd recommend DB2, but for Unix-based either may be an option.

I don't know much about Teradata, but I know it is specifically designed for massive amounts of data, so may be closer to what you are looking for.

A more complete list of choices can be found here: http://en.wikipedia.org/wiki/List_of_relational_database_management_systems

A decent comparason of database here: http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems

100000+ inserts a second is a huge number, no matter what you choose, you are looking at spending a fortune on hardware to handle this.

This is not a question about what DB to choose, it is a question about your skills and experience.

If you think that it is possible with one physical machine - you are on the wrong way. If you know that several machines should be used - then why you ask about DB? DB is not as important as a way you are working with it.

Start from write-only DB on one server and scale it vertically for now. Use several read-only servers and scale them horizontally (here document database can be chosen almost always safely). CQRS concept is something that will ask on your forthcoming questions.

"Handle up to ~100,000 insert commands a second" - is this peak, or normal operation? If normal operation, your 'millions of records stored' is likely to be billions...

With questions like this, I think it is useful to understand the business 'problem' further - as these are non-trivial requirements! The question arises whether the problem justifies this 'brute force' approach, or if there alternative ways of looking at it to achieve the same goal.

If it is needed, then you can consider if there are methods of aggregating / transforming data (bulk loading of data / discarding multiple updates to the same record / loading to multiuple databases and then aggregating downstream as a combined set of ETLs perhaps) to make it easier to manage this volume.

The first thing I would worry about is your disk layout, you are having a mixed workload (OLTP and OLAP) so it is extremely important that your disks are sized and placed correctly in order to achieve this throughput, if your IO sub system can't handle the load it then it doesn't matter what DB you will be using

In addition perhaps those 100,000 inserts a second can be bulk loaded, btw 100,000 rows a second amounts to 72,000,000 rows in just 12 hours so perhaps you want to store billions of rows?

You probably can't handle 100k individual insert operations per second, you will certainly need to batch them into a more managable number.

A single thread wouldn't be able to do that many commands anyway, so I would expect there to be 100-1000 threads doing those inserts.

Depending on your app you will probably need some kind of high availability as well. Unless you're doing something like a scientific app.

My advice is to hire somebody who has a credible answer for you - ideally someone who's done it before - if you don't know, you're not going to be able to develop the app. Hire a senior developer who can answer this question. Ask them in their interview if you like.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM