简体   繁体   English

MongoDB批量插入已存在许多文档的位置

[英]MongoDB Bulk Insert where many documents already exist

I have a largish (~100) array of smallish documents (maybe 10 fields each) to insert in MongoDB. 我有一个较小的(~100)小数据文件(每个可能有10个字段)插入到MongoDB中。 But many of them (perhaps all, but typically 80% or so) of them will already exist in the DB. 但是它们中的许多(可能全部,但通常是80%左右)已经存在于DB中。 The documents represent upcoming events over the next few months, and I'm updating the database every couple of days. 这些文件代表了未来几个月即将发生的事件,我每隔几天就会更新一次数据库。 So most of the events are already in there. 因此大多数事件已经在那里。

Anybody know (or want to guess) if it would be more efficient to: 任何人都知道(或想猜)是否更有效:

  1. Do the bulk update but with continueOnError = true, eg 进行批量更新但是使用continueOnError = true,例如

db.collection.insert(myArray, {continueOnError: true}, callback)

  1. do individual inserts, checking first if the _ID exists? 单独插入,首先检查_ID是否存在?

  2. First do a big remove (something like db.collection.delete({_id: $in : [array of all the IDs in my new documents] }) , then a bulk insert? 首先做一个大的删除(比如db.collection.delete({_id: $in : [array of all the IDs in my new documents] }) ,然后批量插入?

I'll probably do #1 as that is the simplest, and I don't think that 100 documents is all that large so it may not matter, but if there were 10,000 documents? 我可能会做#1,因为那是最简单的,我不认为100个文件都那么大所以它可能无关紧要,但如果有10,000个文件? I'm doing this in JavaScript with the node.js driver if that matters. 我正在使用node.js驱动程序在JavaScript中执行此操作。 My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming??? 我的背景是在Java中,异常是耗时的,这是我要问的主要原因 - “continueOnError”选项是否耗时?

ADDED: I don't think "upsert" makes sense. 补充:我不认为“upsert”是有道理的。 That is for updating an individual document. 这是为了更新单个文档。 In my case, the individual document, representing an upcoming event, is not changing. 就我而言,代表即将发生的事件的单个文档没有改变。 (well, maybe it is, that's another issue) (好吧,也许是,这是另一个问题)

What's happening is that a few new documents will be added. 发生的事情是会添加一些新文件。

My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming??? 我的背景是在Java中,异常是耗时的,这是我要问的主要原因 - “continueOnError”选项是否耗时?

The ContinueOnError flag for Bulk Inserts only affects the behaviour of the batch processing: rather than stopping processing on the first error encountered, the full batch will be processed. 批量插入ContinueOnError标志仅影响批处理的行为:不是在遇到第一个错误时停止处理,而是处理完整批处理。

In MongoDB 2.4 you will only get a single error for the batch, which will be the last error encountered. 在MongoDB 2.4中,批处理只会出现一个错误,这将是遇到的最后一个错误。 This means if you do care about catching errors you would be better doing individual inserts. 这意味着如果您确实关心捕获错误,那么最好进行单独插入。

The main time savings for bulk insert vs single insert is reduced network round trips. 批量插入与单插入的主要时间节省是减少网络往返。 Instead of sending a message to the MongoDB server per document inserted, drivers can break down bulk inserts into batches of up to the MaxMessageSizeBytes accepted by the mongod server (currently 48Mb). 每个插入的文档不会向MongoDB服务器发送消息,驱动程序可以将批量插入分解为最多为mongod服务器接受的MaxMessageSizeBytes (目前为48Mb)的批量。

Are bulk inserts appropriate for this use case? 批量插件是否适合此用例?

Given your use case of only 100s (or even 1000s) of documents to insert where 80% already exist, there may not be a huge benefit in using bulk inserts (especially if this process only happens every few days). 鉴于您的用例只有100%(甚至1000)的文档要插入已经存在80%的文档,使用批量插入可能没有很大的好处(特别是如果这个过程每隔几天就会发生一次)。 Your small inserts will be combined in batches, but 80% of the documents don't actually need to be sent to the server. 您的小插入将分批组合,但实际上不需要将80%的文档发送到服务器。

I would still favour bulk insert with ContinueOnError over your approach of deletion and re-insertion, but bulk inserts may be an unnecessary early optimisation given the number of documents you are wrangling and the percentage that actually need to be inserted. 我仍然赞成使用ContinueOnError进行批量插入而不是删除和重新插入的方法,但是考虑到要进行争论的文档数量和实际需要插入的百分比,批量插入可能是不必要的早期优化。

I would suggest doing a few runs with the different approaches to see what the actual impact is for your use case. 我建议使用不同的方法进行一些运行,以了解对您的用例的实际影响。

MongoDB 2.6 MongoDB 2.6

As a head's up, the batch functionality is being significantly improved in the MongoDB 2.5 development series (which will culminate in the 2.6 production release). 作为一个领先者,MongoDB 2.5开发系列(最终将在2.6产品发布中)的批处理功能得到了显着改进。 Planned features include support for bulk upserts and accumulating per-document errors rather than a single error per batch. 计划的功能包括支持批量upsert和累积每个文档错误,而不是每批次一个错误。

The new write commands will require driver changes to support, but may change some of the assumptions above. 新的写入命令将需要驱动程序更改以支持,但可能会更改上面的一些假设。 For example, with ContinueOnError using the new batch API you could end up getting a result back with the 80% of your batch IDs that are duplicate keys. 例如,使用新的批处理API ContinueOnError使用ContinueOnError ,您最终可能会得到80%作为重复键的批处理ID的结果。

For more details, see the parent issue SERVER-9038 in the MongoDB issue tracker. 有关更多详细信息,请参阅MongoDB问题跟踪器中的父问题SERVER-9038

collection.insert(item, {continueOnError: true, safe: true}, function(err, result) {
                    if (err && err.code != "11000"){
                        throw err;
                     }

                    db.close();
                    callBack();
});

For your case, I'd suggest you consider fetching a list of the existing document _id s, and then only sending the documents that aren't in that list already. 对于您的情况,我建议您考虑获取现有文档_id的列表,然后仅发送不在该列表中的文档。 While you could use update with upsert to update individually, there's little reason to do so. 虽然您可以使用update with upsert来单独更新,但没有理由这样做。 Unless the list of _id s is extremely long (tens of thousands), it would be more efficient to grab the list and do the comparison than do individual updates to the database for each document (with some large percentage apparently failing to update). 除非_id的列表非常长(数万),否则获取列表并进行比较比为每个文档对数据库进行单独更新(有些大百分比显然无法更新)更有效。

I wouldn't use the continueOnError and send all documents ... it's less efficient. 我不会使用continueOnError并发送所有文件......效率较低。

I'd vouch to use an upsert to let mongo deal with the update or insert logic, you can also use multi to update multiple documents that match your criteria: 我保证使用upsert让mongo处理更新或插入逻辑,你也可以使用multi来更新符合你标准的多个文件:

From the documentation: 从文档:

upsert Optional parameter, if set to true, creates a new document when no document matches the query criteria. upsert可选参数,如果设置为true,则在没有文档与查询条件匹配时创建新文档。 The default value is false, which does not insert a new document when no match is found. 默认值为false,如果未找到匹配项,则不会插入新文档。 The syntax for this parameter depends on the MongoDB version. 此参数的语法取决于MongoDB版本。 See Upsert Parameter . 请参见Upsert参数

multi Optional parameter, if set to true, updates multiple documents that meet the query criteria. multi可选参数,如果设置为true,则更新满足查询条件的多个文档。 If set to false, updates one document. 如果设置为false,则更新一个文档。 The default value is false. 默认值为false。 For additional information, see Multi Parameter. 有关其他信息, 请参阅多参数。

db.collection.update(
   <query>,
   <update>,
   { upsert: <boolean>, multi: <boolean> }
)

Here is the referenced documentation: http://docs.mongodb.org/manual/reference/method/db.collection.update/ 以下是参考文档: http//docs.mongodb.org/manual/reference/method/db.collection.update/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM