简体   繁体   中英

Cassandra or PostgreSQL: High volume of Inserts per minute

Here is my scenario:

  1. I've 100,000+ tables .
  2. I've to make inserts in each table every minute, ie 100,000+ inserts per minute ALL in separate tables .
  3. Data loss doesn't matter much but speed and cost does.
  4. Insertion fields would be id, param1, param2, param3, param4, param5, timestamp.

Please let me know which database would be faster and cheaper for this case.

Cassandra may face serious scalability issues with 100,000 separate tables. 100,000 separate tables means a multiple of 100,000 open files (so you'll need to make sure your kernel is configured to allow so many open files), 100,000 memtables (where the last modifications to each table are temporarily kept in memory) so you'll need a lot of memory.

An alternative way to do something like this in Cassandra is to have one table, with 100,000 different partitions (which is the Cassandra name for wide rows). Each minute you'd be adding one further row (a small entry) to each of the existing partitions. To avoid partitions growing huge after, say, months of adding entries, what one normally does is to start a new partition every, say, week (each week has about 10,000 minutes). In Cassandra modelling this is often called "time series data".

In your question, you only mentioned writing data, and not reading it. Assuming this is not an oversight, and you really care more about the write performance and not read performance, then Cassandra is a good fit because it is especially fast for writes. If you absolutely care about speed and performance-per-dollar, you should also take a look at Scylla , a re-implementation of Cassandra in C++.

Sounds like data model fits to time series model. TimeScaleDB may handle your model with new distributed model. The tables would be just one more indexed field. Ie keep data in time order, enable compression. May consider different types of index not restrict yourself to B-trees only.

Our finance data tests showed amazing compression ratios - specially if all tables have similar data for close time periods eg cumulative and scaled values with 3-4k instruments. Didn't try with 100k but may consider do some benchmarks and see where is limit and in case of steep degradation shard to different machine/cluster.

Maintenance may be bit problematic if one decide to manage multiple manually shared servers, but single box can do cost magic in comparison to modern clusters. Multiple powerful isolated boxes can be used if data loss can be tolerated eg replayed from different source in reasonable time (like efficient market data replay from archives)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM