简体   繁体   中英

Storing a large amount of analytical data

I normally use SQL Server and C# for all projects I do, however I am looking upon a project that could potentially span to billions of rows of data and I don't feel comfortable doing this in SQL Server .

The data I will be storing is

  • datetime
  • ipAddress
  • linkId
  • possibly other string related data

I have only ever dealt with relational databases before and hence was looking for some guidance on what database technology would be best suited for this type of data storage. One that could scale and do so at a low cost (when compared to sharding SQL Server)

I would then need to pull this data out based on linkId.

Also would I be able to do ordering within the query to the DB or would that be best done in the application?

EDIT: It will be cloud based. Hence I was looking at SQL Azure, which I have used extensively, however it just starts causing issues as the row count goes up.

Since you are looking for general guidance, I feel it is ok to provide an answer that you have prematurely dismissed ;-). Microsoft SQL Server can definitely handle this situation (in the generic sense of having a table of those fields and billions of rows). I have personally worked on a Data Warehouse that had 4 nodes, each of which had the main fact table holding 1.2 - 1.5 Billion rows (and growing) and responded to queries quickly enough, despite some aspects of the data model and indexing that could have been done better. It is a web-based application with many users hitting it all day long (though some periods of the day much harder than others). Also, that fact table was much wider than the table you are describing, unless that "possibly other string related data" is rather large (but there are ways to properly model that as well). True, the free Express edition might not meet your needs, but Standard Edition likely would and it is not super expensive. Enterprise has a nice feature for doing online index rebuilds, but that alone might not warrant the huge jump in license fees.

Keep in mind that with little to no description of what you are actually trying to accomplish with this data, it is hard for me to say that MS SQL Server will definitely meet your needs. But, given that you seemed to have ruled it out entirely on the basis of the large number of rows you might possibly get, I can at least speak to that situation: with good data modeling, good index design, and regular index maintenance, MS SQL Server can definitely handle billions of rows. Now, whether or not it is the best choice for your project depends on what you are trying to do, what the client is comfortable with maintaining, etc.

Good luck :)

EDIT:

  • When I said (above) that the queries came back "quickly enough", I meant anywhere from 1 to 90 seconds, depending on various factors. Keep in mind that these were not simple queries, and in my opinion, several improvements could be made to the data modeling and index strategy.
  • I intentionally left out the Table Partitioning feature not only because it is only in Enterprise Edition, but also because it is more often misunderstood and hence misused than understood and used properly. Table/Index partitioning in SQL Server is not a means of "sharding".
  • I also did not mention Column Store indexes because they are only available in Enterprise Edition. However, for projects large enough to justify the cost, Column Store indexes are certainly worth investigating. They were introduced in SQL Server 2012 and came with the restriction that the table could not be updated once the Column Store index was created. You can get around that, to a degree, using Table Partitioning, but in SQL Server 2014 that restriction will be removed.

Given that this needs to be cloud-based and that you use .Net / C#, if you really are only talking about a few tables (so far just the stated one and the implied "Link" table--source of LinkID) and hence might not need relationships or some of the other RDBMS features, then one option is to use Amazon's DynamoDB. DynamoDB is part of AWS (Amazon Web Services) and is a NoSQL database. Development and even the initial stage of rolling out a project are made a bit easier by their low-end, free tier. As of 2013-11-04, the main DynamoDB page states that:

AWS Free Tier includes 100MB of Storage, 5 Units of Write Capacity, and 10 Units of Read Capacity with Amazon DynamoDB.

Here is some documentation: Overview , How to Query with .Net , and general .Net SDK .

BE AWARE: When looking into how much you think it might cost, be sure to include related AWS pieces, such as Network usage, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM