[英]MongoDB poor write performance on large collections with 50.000.000 documents plus
I have got a MongoDB which store product data for 204.639.403 items, those data has already spitted up, by the item's country, into four logical databases running on the same physical machine in the same MongoDB process. 我有一个MongoDB,它存储了204.639.403项的产品数据,这些数据已经由项目的国家分散到同一个MongoDB进程中同一物理机器上运行的四个逻辑数据库中。
Here is a list with the number of documents per logical database: 这是一个列表,其中包含每个逻辑数据库的文档数:
My problem is that the database write performance is getting worser, especially writes to the largest of the four databases (De) has become really bad, according to iotop
the mongod process uses 99% of the IO time with less than 3MB writes and 1.5MB reads per second. 我的问题是数据库写入性能越来越差,特别是对四个数据库中最大的数据库(De)的写入变得非常糟糕,根据
iotop
,mongod进程使用99%的IO时间,写入少于3MB,1.5MB每秒读取数。 This leads to long locking databases, 100%+ lock become normally according to mongostat
- even if all processes writing and reading to the other country databases has been stopped. 这导致长锁定数据库, 100%+锁定通常根据
mongostat
- 即使所有进程写入和读取到其他国家/地区数据库已被停止。 The current slave reaches a LOAD up to 6, the replica set master has a load of 2-3 at the same time, therefore it leads to a replication lag, too. 当前从站达到最大为6的LOAD,副本集主机同时具有2-3的负载,因此也导致复制滞后。
Each databases has the same data and index structure, I am using the largest database (De) for further examples only. 每个数据库都有相同的数据和索引结构,我使用最大的数据库(De)仅用于进一步的示例。
This is a random item taken from the database, just as example, the structure is optimized to gather all important data with a single read: 这是从数据库中取出的一个随机项,就像示例一样,该结构经过优化,只需一次读取即可收集所有重要数据:
{
"_id" : ObjectId("533b675dba0e381ecf4daa86"),
"ProductId" : "XGW1-E002F-DW",
"Title" : "Sample item",
"OfferNew" : {
"Count" : 7,
"LowestPrice" : 2631,
"OfferCondition" : "NEW"
},
"Country" : "de",
"ImageUrl" : "http://….jpg",
"OfferHistoryNew" : [
…
{
"Date" : ISODate("2014-06-01T23:22:10.940+02:00"),
"Value" : {
"Count" : 10,
"LowestPrice" : 2171,
"OfferCondition" : "NEW"
}
}
],
"Processed" : ISODate("2014-06-09T23:22:10.940+02:00"),
"Eans" : [
"9781241461959"
],
"OfferUsed" : {
"Count" : 1,
"LowestPrice" : 5660,
"OfferCondition" : "USED"
},
"Categories" : [
NumberLong(186606),
NumberLong(541686),
NumberLong(288100),
NumberLong(143),
NumberLong(15777241)
]
}
Typical querys are form simple one like by the ProductId or an EAN only to refinements by the category and sorted by its A rank or refinements by the category and an A rank range (1 up to 10.000 for example) and sorted by the B rank… . 典型的查询形式简单,例如ProductId或EAN,仅按类别进行细化,并按类别进行排序或按类别排序或按A级别范围排序(例如1至10,000)并按B级排序...... 。
This are the stats from the largest db: 这是最大数据库的统计数据:
{
"ns" : "De.Item",
"count" : 61216165,
"size" : 43915150656,
"avgObjSize" : 717,
"storageSize" : 45795192544,
"numExtents" : 42,
"nindexes" : 6,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 41356824320,
"indexSizes" : {
"_id_" : 2544027808,
"RankA_1" : 1718096464,
"Categories_1_RankA_1_RankB_-1" : 16383534832,
"Eans_1" : 2846073776,
"Categories_1_RankA_-1" : 15115290064,
"ProductId_1" : 2749801376
},
"ok" : 1
}
It is mentionable that the index size is nearly half of the storage size . 值得一提的是,索引大小几乎是存储大小的一半 。
Each country DB has to handle 3-5 million updates/inserts per day , my target is to perform the write operations in less than five hours during the night. 每个国家/地区数据库每天必须处理3-5百万次更新/插入 ,我的目标是在夜间不到五个小时内执行写入操作。
Currently it's a replica set with two servers, each has 32GB RAM and a RAID1 with 2TB HDDs. 目前它是一个具有两个服务器的副本集,每个服务器具有32GB RAM和具有2TB HDD的RAID1。 Simple optimizations like the deadlock scheduler and noatime has already been made.
已经进行了诸如死锁调度程序和noatime之类的简单优化。
I have worked out some optimizations strategies: 我制定了一些优化策略:
But there should be other optimization strategies, too did not comes to my mind I would like to hear about! 但是应该有其他的优化策略,我也想听不到我的想法!
Which optimization strategy sound most promising or is a mixture of several optimizations is needed here? 这里需要哪种优化策略最有希望,或者是多种优化的混合?
Most likely you are running into issues due to record growth, see http://docs.mongodb.org/manual/core/write-performance/#document-growth . 由于记录增长,您很可能遇到问题,请参阅http://docs.mongodb.org/manual/core/write-performance/#document-growth 。
Mongo prefers records of fixed (or at least bounded) size. Mongo更喜欢固定(或至少有界)大小的记录。 Increasing the record size beyond the pre-allocated storage will cause the document to be moved to another location on disk, multiplying your I/O with each write.
将记录大小增加到超出预分配的存储空间将导致文档移动到磁盘上的另一个位置,从而使每次写入的I / O倍增。 Consider pre-allocating "enough" space for your average document on insert, if your document sizes are relatively homogenous.
如果您的文档大小相对同质,请考虑为插入时的普通文档预分配“足够”空间。 Otherwise consider splitting rapidly growing nested arrays into a separate collection, thereby replacing updates with inserts.
否则,请考虑将快速增长的嵌套数组拆分为单独的集合,从而用插入替换更新。 Also check your fragmentation and consider compacting your databases from time to time, so that you have a higher density of documents per block which will cut down on hard page faults.
还要检查您的碎片并考虑不时地压缩您的数据库,以便每个块具有更高密度的文档,这将减少硬页面错误。
Would you consider using a database with better throughput that supports documents? 您是否会考虑使用支持文档的更高吞吐量的数据库? I've heard success stories with TokuMX .
我听过TokuMX的成功故事。 And FoundationDB (where I'm an engineer) has very good performance with high-concurrent write loads and large documents.
而FoundationDB (我是一名工程师)在高并发写入负载和大型文档方面具有非常好的性能。 Happy to answer further questions about FoundationDB.
很高兴回答有关FoundationDB的更多问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.