简体   繁体   中英

What are the key performance advantages of in-memory databases vs disk based NoSQL databases?

Reading Designing Data Intensive Applications book, I encountered this statement:

Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don't need to read from disk. Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk. OLTP Through the Looking Glass, and What We Found There

So my question are:

  1. Given that a disk based NoSQL(Mongo DB) is given the same amount of RAM as an in-memory database(Redis), will they perform about the same?
  2. What if a disk based database used write back caching strategy with async writes to persistent storage with both having same amount of RAM. Will that make it's performance similar to an in-memory database?
  3. Even if serialization has such a high penalty (as mentioned in the quote above) would disk based NoSQL match the performance of in-memory databases for a read heavy system with same amount of cache given to both databases.

I'm very new to the NoSQL world so please do direct me in to the right direction if I missed something.
PS: I've read Difference between In memory databases and disk memory database but it doesn't address my specific questions.

I quickly tested your (1). This is probably too naive, but it should give a first answer, and it's meant more as encouragement to test yourself.

Inserting 100,000 key/value pairs:

Redis

Redis setting
time: 4.391808
Redis getting: second run
time: 4.129066

MongoDB

Mongo setting
time: 30.313092
Mongo getting: second run
time: 33.969624

BUT: REDIS and MongoDB are very different systems, and it isn't quite clear if it's useful to compare the two. Don't optimise for performance unless you are actually experiencing performance problems.

Notes

MongoDB was started with mongod --storageEngine wiredTiger --syncdelay 0 --journalCommitInterval 500 --dbpath /usr/local/var/mongodb and the machine has 32GB RAM (=plenty to keep all data in memory).

Here's the ruby script I used:

redis = Redis.current
redis.flushall
mongodb = Mongo::Client.new([ '127.0.0.1:27017' ], :database => 'test')
collection = mongodb[:mycollection]
collection.delete_many({})
collection.indexes.create_one(name: 1)

setids = (0..100000).to_a.map {|i| {name: "#plop_#{i}", val: i} }.shuffle
getids = (0..100000).to_a.map {|i| {name: "#plop_#{i}"} }.shuffle

puts "Redis setting"

time do
    x = 0
    setids.each do |i|
        x += i[:val]
        redis.set(i[:name], i[:val])
    end
    fail unless x == (0..100000).sum
end

["first", "second"].each do |run|
    puts "Redis getting: #{run} run"

    time do
        x = 0
        getids.each do |i|
            x += r.get(i[:name]).to_i
        end
        fail unless x == (0..100000).sum
    end
end

puts "Redis setting (hashes)"
redis.flushall
time do
    x = 0
    setids.each do |i|
        x += i[:val]
        redis.hset(i[:name],:val, i[:val])
    end
    fail unless x == (0..100000).sum
end


["first", "second"].each do |run|
    puts "Redis getting (hashes): #{run} run"
    time do
        x = 0
        getids.each do |i|
            x += redis.hget(i[:name], :val).to_i
        end
        fail unless x == (0..100000).sum
    end
end

puts "Mongo setting"
time do
    x = 0
    setids.each do |i|
        x += i[:val]
        collection.insert_one(i)
    end
    fail unless x == (0..100000).sum
end

["first", "second"].each do |run|
    puts "Mongo getting: #{run} run"

    time do
        x = 0
        getids.each do |i|
            x += collection.find(i).first[:val]
        end
        fail unless x == (0..100000).sum
    end
end

def time
    start = Time.now
    yield
    puts "time: #{Time.now - start}"
end

Full disclosure: I represent the vendor of eXtremeDB, one of the first in-memory database systems (first released 2001).

  1. Given that a disk based NoSQL(Mongo DB) is given the same amount of RAM as an in-memory database(Redis), will they perform about the same?

No. They are very different DBMS, as evidenced by Matt's answer.

  1. What if a disk based database used write back caching strategy with async writes to persistent storage with both having same amount of RAM. Will that make it's performance similar to an in-memory database?

Again, no. Asynchronous, or not, it's still a system activity that will steal CPU cycles from what would otherwise be a CPU-bound system. (I'm making an assumption here, that the rational for an in-memory database is performance, so the database activity is intense.) In addition, the manner in which disk-based vs. in-memory DBMS handle transactions' atomicity is different. It can be more simple for a pure in-memory DBMS, and optimized for the normal case of committing a transaction. In the best case, we can update the data in place and copy the before image to a rollback buffer. If the transaction commits, we just discard the rollback buffer. Thus, commits are very fast, but aborts take more time. Things get more complex when you need to enforce READ-COMMITTED in a concurrent access setting (MVCC).

  1. Even if serialization has such a high penalty (as mentioned in the quote above) would disk based NoSQL match the performance of in-memory databases for a read heavy system with same amount of cache given to both databases.

No. Any disk-based DBMS does not (cannot) know that it's data is fully cached. It will always go through the logic of determining if the requested page is in cache, or not. That isn't free (it uses CPU cycles). A true in-memory DBMS has no such lookup logic, and eliminates that processing. Furthermore, disk-based DBMS use large page sizes (usually a multiple of the disk blocking factor, so 4K, 8K, 16K etc) that can hold many records/rows/objects/documents/... After the lookup for whether a page is in cache, or not, it's still necessary to find the specific object on the page. Of course, this doesn't apply to every DBMS - implementation details vary widely. Regardless, an in-memory database doesn't care about the disk blocking factor, and doesn't want to waste cycles finding an object on a page. We use a small page size that eliminates or drastically reduces the in-page search for an object.

Also, the way that disk-based database vs. in-memory DBMS implement indexes is (or should be) very different. Without going into detail (see reference to the white papers below), the end result is that b-trees are deeper for a disk-based database with an equal number of rows, compared to an in-memory database. Or, an in-memory database may use a different type of index altogether (t-tree or hash). But let's stick with b-tree. The effect of a deeper tree is that the average and worst-case number of levels needed to walk the tree to find the search value is higher. Lastly, once the b-tree node (which is equal to a database page) is found, a binary search is used to find the search value (the slot) on the page. That binary search on a page that is 4K (or 16K, or...) takes many more iterations than for a page that's a few hundred bytes. Again, it all boils down to more CPU cycles.

There are other considerations. Please feel free to read our whitepapers (free access, no registration required) " Will The Real IMDS Please Stand Up? "

and " In-Memory Database Systems: Myths and Facts ".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM