简体   繁体   中英

MongoDB ingestion very slow

Problem

We are attempting to ingest 6 billion elements into a sharded cluster and ingestion starts very fast but slows down to almost zero progress after a large number of elements are inserted (eg 5 / 6 billion). We have tried disabling the balancer and not indexing until after ingestion but neither have helped.

Attributes of the data and ingestion pattern:

  1. Chunk size: 64MB (default)
  2. Batch size (ie size of bulk requests): 1000
  3. 12 threads per node doing concurrent bulk requests of size 1000 documents
  4. Average document size: 568.7417637713185
  5. Number of chunks: 66586

Points of interest in the mongo logs include: Consistent splitVector() & splitChunk() calls with a gradually increasing number of splits per call (eg 2 at the beginning and over 3k with 5bn+ elements ingested).

Excerpt from start of ingestion when throughput is very good - notice the low split count

2016-12-12T18:21:15.652+0000 W SHARDING [conn139] Finding the split vector for db.d1 over { _id: "hashed" } keyCount: 59074 numSplits: 2 lookedAt
: 11332 took 146ms
2016-12-12T18:21:15.652+0000 I COMMAND  [conn139] command admin.$cmd command: splitVector { splitVector: "db.d1", keyPattern: { _id: "hashed" }, 
min: { _id: 8891696537615988474 }, max: { _id: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } keyUpdates:0 w
riteConflicts:0 numYields:1011 reslen:208 locks:{ Global: { acquireCount: { r: 2024 } }, Database: { acquireCount: { r: 1012 } }, Collection: { a
cquireCount: { r: 1012 } } } protocol:op_command 146ms
2016-12-12T18:21:15.652+0000 I SHARDING [conn131] received splitChunk request: { splitChunk: "db.d1", keyPattern: { _id: "hashed" }, min: { _id: 
8891696537615988474 }, max: { _id: MaxKey }, from: "shardrepl6", splitKeys: [ { _id: 9043183284609251672 }, { _id: 9194457032126213414 } ], confi
gdb: "configrep/nohdp-m:27019,nohdp-w-0:27019,nohdp-w-1:27019", shardVersion: [ Timestamp 6000|121, ObjectId('584ee95d47c9d485eeba7e9e') ], epoch
: ObjectId('584ee95d47c9d485eeba7e9e') }

Excerpt from mongo log when 5bn+ elements are ingested - notice that the inserts are taking over 10 minutes now for 1000 documents!:

2016-12-14T10:15:09.266+0000 W SHARDING [conn7948] Finding the split vector for db.d1 over { _id: "hashed" } keyCount: 59074 numSplits: 3295 look
edAt: 49399 took 147703ms
2016-12-14T10:15:09.270+0000 I COMMAND  [conn7948] command admin.$cmd command: splitVector { splitVector: "db.d1", keyPattern: { _id: "hashed" },
 min: { _id: 8286917826972549101 }, max: { _id: 8891696537615988474 }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } 
keyUpdates:0 writeConflicts:0 numYields:1521105 reslen:78136 locks:{ Global: { acquireCount: { r: 3042212 } }, Database: { acquireCount: { r: 152
1106 } }, Collection: { acquireCount: { r: 1521106 }, acquireWaitCount: { r: 4 }, timeAcquiringMicros: { r: 273 } } } protocol:op_command 147707m
s
2016-12-14T10:15:09.272+0000 I NETWORK  [initandlisten] connection accepted from 10.132.0.27:57186 #7950 (98 connections now open)
2016-12-14T10:15:09.275+0000 I SHARDING [conn7950] warning: log line attempted (96kB) over max size (10kB), printing beginning and end ... receiv
ed splitChunk request: { splitChunk: "db.d1", keyPattern: { _id: "hashed" }, min: { _id: 8286917826972549101 }, max: { _id: 8891696537615988474 }
, from: "shardrepl6", splitKeys: [ { _id: 8287101346889983059 }, { _id: 8287284735574907001 }, { _id: 8287468462323984570 }, { _id: 8287652170283
995411 }, { _id: 8287835497934387608 }, { _id: 8288018804450410355 }, { _id: 8288203522286937586 }, { _id: 8288388071505970055 }, { _id: 82885716
40651055014 }, { _id: 8288755180953265490 }, { _id: 8288938525600999756 }, { _id: 8289121602315847626 }, { _id: 8289303486256147041 }, { _id: 828
9487583455291418 }, { _id: 8289670676306614691 }, { _id: 8289854621499682550 }, { _id: 8290039202265877300 }, { _id: 8290222816842142600 }, { _id
: 8290407380208535449 }, { _id: 8290591646991349090 }, { _id: 8290775216085028368 }, { _id: 8290958397483060614 }, { _id: 8291141296044448774 }, 
{ _id: 8291325782881743950 }, { _id: 8291509640956353313 }, { _id: 8291692615717188224 }, { _id: 8291875865635048179 }, { _id: 829205937649184666
7 }, { _id: 8292243673713657589 }, { _id: 8292427637936143831 }, { _id: 8292611257535749373 }, { _id: 8292795483981231158 }, { _id: 8292978669576
490373 }, { _id: 8293161968622707955 }, { _id: 8293345020701432227 }, { _id: 8293527882782566480 }, { _id: 8293711119262214324 }, { _id: 82938944
23409576856 }, { _id: 8294078020795137503 }, { _id: 8294261767542526273 }, { _id: 8294443892466553095 }, { _id: 8294627093936215270 }, { _id: 829
4809793561574760 }, { _id: 8294993822250254718 }, { _id: 8295177652382819277 }, { _id: 8295360493433435532 }, { _id: 8295545257671541454 }, { _id
: 8295728932259232603 }, { _id: 8295913123304033493 }, { _id: 8296096457342755886 }, { _id: 8296280022998881301 }, { _id: 8296464671386836894 }, 
{ _id: 8296647186544445907 }, { _id: 8296830323943653126 }, { _id: 8297013730098864353 }, { _id: 8297196185619634257 }, { _id: 829737940146222018
0 }, { _id: 8297564988642525703 }, { _id: 8297747491525850110 }, { _id: 8297931581379542681 }, { _id: 8298114993102153399 }, { _id: 8298297439986
149518 }, { _id: 8298480444951235977 }, { _id: 8298663990688993348 }, { _id: 8298847055994867152 }, { _id: 8299031052877402611 }, { _id: 82992151
24614070361 }, { _id: 8299398215999968277 }, { _id: 8299582388940175975 }, { _id: 8299765821373923699 }, { _id: 8299950810133585214 }, { _id: 830
0133645889704871 }, { _id: 8300315774024380628 }, { _id: 8300499423759882486 }, { _id: 8300682010031403080 }, { _id: 8300865738536319852 }, { _id
: 8301049781229486084 }, { _id: 8301232456988002710 }, { _id: 8301415794714171555 }, { _id: 8301599451945947359 }, { _id: 8301783564425118532 }, 
{ _id: 8301968056396672988 }, { _id: 8302151637208684223 }, { _id: 8302333658023334194 }, { _id: 8302517935322753985 }, { _id: 830270052609467309
8 }, { _id: 8302883964901452771 }, { _id: 8303066862469161182 }, { _id: 8303249789429502361 }, { _id: 8303433114392011605 }, { _id: 8303616070263
699043 }, { _id: 8303799771311359310 }, { _id: 8303983170608151334 }, { _id: 8304166553694211715 }, { _id: 8304349400014341014 },
...
d: 8890990516605405909 }, { _id: 8891174917314061663 }, { _id: 8891360375988727310 }, { _id: 8891543753314122854 } ], configdb: "configrep/nohdp-m:27019,nohdp-w-0:27019,nohdp-w-1:27019", shardVersion: [ Timestamp 6000|49, ObjectId('584ee95d47c9d485eeba7e9e') ], epoch: ObjectId('584ee95d47c9d485eeba7e9e') }

2016-12-14T10:15:09.278+0000 I SHARDING [conn7950] distributed lock 'db.d1' acquired for 'splitting chunk [{ _id: 8286917826972549101 }, { _id: 8891696537615988474 }) in db.d1', ts : 58511bad1fce2179d2bc31de
2016-12-14T10:15:09.279+0000 I SHARDING [conn7950] remotely refreshing metadata for db.d1 based on current shard version 6|96080||584ee95d47c9d485eeba7e9e, current metadata version is 6|96089||584ee95d47c9d485eeba7e9e
2016-12-14T10:15:09.290+0000 I SHARDING [conn7950] metadata of collection db.d1 already up to date (shard version : 6|96080||584ee95d47c9d485eeba7e9e, took 11 ms)
2016-12-14T10:15:09.294+0000 W SHARDING [conn7950] splitChunk cannot find chunk [{ _id: 8286917826972549101 },{ _id: 8891696537615988474 }) to split, the chunk boundaries may be stale
2016-12-14T10:15:09.297+0000 I SHARDING [conn7950] distributed lock with ts: 58511bad1fce2179d2bc31de' unlocked.
2016-12-14T10:15:09.298+0000 I NETWORK  [conn7950] end connection 10.132.0.27:57186 (97 connections now open)
2016-12-14T10:16:04.652+0000 I NETWORK  [conn7947] end connection 10.132.0.27:57152 (96 connections now open)
2016-12-14T10:19:31.075+0000 I COMMAND  [conn98] command db.d1 command: insert { insert: "d1", documents: 186, ordered: false, shardVersion: [ Timestamp 6000|49, ObjectId('584ee95d47c9d485eeba7e9e') ] } ninserted:186 keyUpdates:0 writeConflicts:0 numYields:0 reslen:217 locks:{ Global: { acquireCount: { r: 189, w: 189 } }, Database: { acquireCount: { w: 189 } }, Collection: { acquireCount: { w: 3 } }, Metadata: { acquireCount: { w: 186 } }, oplog: { acquireCount: { w: 186 } } } protocol:op_command 136ms
2016-12-14T10:20:09.270+0000 I NETWORK  [conn7948] end connection 10.132.0.27:57158 (95 connections now open)
2016-12-14T10:26:19.571+0000 I COMMAND  [conn91] command db.d1 command: insert { insert: "d1", documents: 170, ordered: false, shardVersion: [ Timestamp 6000|49, ObjectId('584ee95d47c9d485eeba7e9e') ] } ninserted:170 keyUpdates:0 writeConflicts:0 numYields:0 reslen:217 locks:{ Global: { acquireCount: { r: 173, w: 173 } }, Database: { acquireCount: { w: 173 } }, Collection: { acquireCount: { w: 3 } }, Metadata: { acquireCount: { w: 170 } }, oplog: { acquireCount: { w: 170 } } } protocol:op_command 122ms

MongoDB Configuration:

| Name       | Role 1        | Role 2  |
|------------|---------------|---------|
| nohdp-m    | Config Server |         |
| nohdp-w-0  | Config Server |         |
| nohdp-w-1  | Config Server |         |
| nohdp-w-2  | Shard 1       | Router1 |
| nohdp-w-3  | Shard 2       | Router2 |
| nohdp-w-4  | Shard 3       | Router3 |
| nohdp-w-5  | Shard 4       | Router4 |
| nohdp-w-6  | Shard 5       | Router5 |
| nohdp-w-7  | Shard 6       | Router6 |
| nohdp-w-8  | Replica 1     |         |
| nohdp-w-9  | Replica 2     |         |
| nohdp-w-10 | Replica 3     |         |
| nohdp-w-11 | Replica 4     |         |
| nohdp-w-12 | Replica 5     |         |
| nohdp-w-13 | Replica 6     |         |

Excerpt from the stats() command at the end of an ingestion run:

mongos> db.d1.stats()
{
    "sharded" : true,
    "capped" : false,
    "ns" : "db.d1",
    "count" : 5939228000,
    "size" : NumberLong("3377887008160"),
    "storageSize" : NumberLong("1496882728960"),
    "totalIndexSize" : 223020806144,
    "indexSizes" : {
        "_id_" : 60463452160,
        "_id_hashed" : 162557353984
    },
    "avgObjSize" : 568.7417637713185,
    "nindexes" : 2,
    "nchunks" : 66586,
    "shards" : {
        "shardrepl1" : {
            "ns" : "db.d1",
            "count" : 989892578,
            "size" : 562994179550,
            "avgObjSize" : 568,
            "storageSize" : 249483833344,
            "capped" : false,
            "wiredTiger" : {
                "metadata" : {
                    "formatVersion" : 1
                },
                "creationString" : "allocation_size=4KB,app_metadata=(formatVersion=1),block_allocation=best,block_compressor=snappy,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=true,bloom=true,bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldest=false,chunk_count_limit=0,chunk_max=5GB,chunk_size=10MB,merge_max=15,merge_min=0),memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,source=,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,type=file,value_format=u",
                "type" : "file",
                "uri" : "statistics:table:collection-11-4419223010582756584",
                "LSM" : {
                    "bloom filter false positives" : 0,
                    "bloom filter hits" : 0,
                    "bloom filter misses" : 0,
                    "bloom filter pages evicted from cache" : 0,
                    "bloom filter pages read into cache" : 0,
                    "bloom filters in the LSM tree" : 0,
                    "chunks in the LSM tree" : 0,
                    "highest merge generation in the LSM tree" : 0,
                    "queries that could have benefited from a Bloom filter that did not exist" : 0,
                    "sleep for LSM checkpoint throttle" : 0,
                    "sleep for LSM merge throttle" : 0,
                    "total size of bloom filters" : 0
                },
                "block-manager" : {
                    "allocations requiring file extension" : 0,
                    "blocks allocated" : 0,
                    "blocks freed" : 0,
                    "checkpoint size" : 249483755520,
                    "file allocation unit size" : 4096,
                    "file bytes available for reuse" : 61440,
                    "file magic number" : 120897,
                    "file major version number" : 1,
                    "file size in bytes" : 249483833344,
                    "minor version number" : 0
                },
                "btree" : {
                    "btree checkpoint generation" : 7,
                    "column-store fixed-size leaf pages" : 0,
                    "column-store internal pages" : 0,
                    "column-store variable-size RLE encoded values" : 0,
                    "column-store variable-size deleted values" : 0,
                    "column-store variable-size leaf pages" : 0,
                    "fixed-record size" : 0,
                    "maximum internal page key size" : 368,
                    "maximum internal page size" : 4096,
                    "maximum leaf page key size" : 2867,
                    "maximum leaf page size" : 32768,
                    "maximum leaf page value size" : 67108864,
                    "maximum tree depth" : 0,
                    "number of key/value pairs" : 0,
                    "overflow pages" : 0,
                    "pages rewritten by compaction" : 0,
                    "row-store internal pages" : 0,
                    "row-store leaf pages" : 0
                },
                "cache" : {
                    "bytes currently in the cache" : 33151,
                    "bytes read into cache" : 15496,
                    "bytes written from cache" : 0,
                    "checkpoint blocked page eviction" : 0,
                    "data source pages selected for eviction unable to be evicted" : 0,
                    "hazard pointer blocked page eviction" : 0,
                    "in-memory page passed criteria to be split" : 0,
                    "in-memory page splits" : 0,
                    "internal pages evicted" : 0,
                    "internal pages split during eviction" : 0,
                    "leaf pages split during eviction" : 0,
                    "modified pages evicted" : 0,
                    "overflow pages read into cache" : 0,
                    "overflow values cached in memory" : 0,
                    "page split during eviction deepened the tree" : 0,
                    "page written requiring lookaside records" : 0,
                    "pages read into cache" : 5,
                    "pages read into cache requiring lookaside entries" : 0,
                    "pages requested from the cache" : 4,
                    "pages written from cache" : 0,
                    "pages written requiring in-memory restoration" : 0,
                    "unmodified pages evicted" : 0
                },
                "cache_walk" : {
                    "Average difference between current eviction generation when the page was last considered" : 0,
                    "Average on-disk page image size seen" : 0,
                    "Clean pages currently in cache" : 0,
                    "Current eviction generation" : 0,
                    "Dirty pages currently in cache" : 0,
                    "Entries in the root page" : 0,
                    "Internal pages currently in cache" : 0,
                    "Leaf pages currently in cache" : 0,
                    "Maximum difference between current eviction generation when the page was last considered" : 0,
                    "Maximum page size seen" : 0,
                    "Minimum on-disk page image size seen" : 0,
                    "On-disk page image sizes smaller than a single allocation unit" : 0,
                    "Pages created in memory and never written" : 0,
                    "Pages currently queued for eviction" : 0,
                    "Pages that could not be queued for eviction" : 0,
                    "Refs skipped during cache traversal" : 0,
                    "Size of the root page" : 0,
                    "Total number of pages currently in cache" : 0
                },
                "compression" : {
                    "compressed pages read" : 1,
                    "compressed pages written" : 0,
                    "page written failed to compress" : 0,
                    "page written was too small to compress" : 0,
                    "raw compression call failed, additional data available" : 0,
                    "raw compression call failed, no additional data available" : 0,
                    "raw compression call succeeded" : 0
                },
                "cursor" : {
                    "bulk-loaded cursor-insert calls" : 0,
                    "create calls" : 1,
                    "cursor-insert key and value bytes inserted" : 0,
                    "cursor-remove key bytes removed" : 0,
                    "cursor-update value bytes updated" : 0,
                    "insert calls" : 0,
                    "next calls" : 0,
                    "prev calls" : 1,
                    "remove calls" : 0,
                    "reset calls" : 1,
                    "restarted searches" : 0,
                    "search calls" : 0,
                    "search near calls" : 0,
                    "truncate calls" : 0,
                    "update calls" : 0
                },
                "reconciliation" : {
                    "dictionary matches" : 0,
                    "fast-path pages deleted" : 0,
                    "internal page key bytes discarded using suffix compression" : 0,
                    "internal page multi-block writes" : 0,
                    "internal-page overflow keys" : 0,
                    "leaf page key bytes discarded using prefix compression" : 0,
                    "leaf page multi-block writes" : 0,
                    "leaf-page overflow keys" : 0,
                    "maximum blocks required for a page" : 0,
                    "overflow values written" : 0,
                    "page checksum matches" : 0,
                    "page reconciliation calls" : 0,
                    "page reconciliation calls for eviction" : 0,
                    "pages deleted" : 0
                },
                "session" : {
                    "object compaction" : 0,
                    "open cursor count" : 1
                },
                "transaction" : {
                    "update conflicts" : 0
                }
            },
            "nindexes" : 2,
            "totalIndexSize" : 35938430976,
            "indexSizes" : {
                "_id_" : 10082500608,
                "_id_hashed" : 25855930368
            },
            "ok" : 1
        },

In case anyone is encountering the same issue, we managed to resolve this by doing the following:

  1. Tweaking the chunk size within a shard (not to be confused with batch size)
  2. Increasing the initial number of chunks upon the creation of a collection

Example settings we used to diagnose if this was part of the issue: Chunk Size: 1024 mb (Maximum) Initial Chunks: 49152 (That is 8192 per shard which is the maximum limit)

Doing so meant that mongodb no longer had to split chunks continuously (which was blocking the ingestion calls).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM