简体   繁体   English

零件中的备份/还原集合

[英]Backup/Restore Collection in Parts

I am currently working on an AWS EC2 server and I scraped some data that I stored in a MongoDB collection. 我目前正在使用AWS EC2服务器,并且抓取了一些存储在MongoDB集合中的数据。 This is the only collection in my database. 这是我数据库中的唯一集合。

Now I need to transfer this collection to my local machine to process it. 现在,我需要将此集合转移到我的本地计算机上进行处理。 My problem is that my the remaining disk space on the remote machine is insufficient to dump the whole collection. 我的问题是我在远程计算机上的剩余磁盘空间不足以转储整个集合。 There is room for around 60% of the collection. 有大约60%的收藏空间。 I tried to use db.copy() and db.export() using the host name to directly copy on my local machine but it doesn't work because I'm not on a local network and I have some authentification issues, even with a ssh tunnel. 我尝试使用主机名直接使用db.copy()db.export()在我的本地计算机上进行复制,但是由于我不在本地网络上并且即使在一条ssh隧道。

What I would like to do is to split my big collection into 2 smaller collections and dump each of them. 我想做的是将我的大收藏夹分成2个较小的收藏夹,然后将它们分别倾倒。 Is it possible? 可能吗?

Thank you! 谢谢!

Your best option is to use mongodump to just take parts of the collection. 最好的选择是使用mongodump仅收集一部分。 It's also the best thing for a "bulk migration" of data so there are usage parts than can apply to working directly between hosts if you can change the networking setup between the hosts to allow this. 这也是“批量迁移”数据的最佳选择,因此,如果您可以更改主机之间的网络设置以允许这样做,那么其中的使用部分可能会不适用于直接在主机之间工作。

If you need to use mongodump on only part of the collection then the general case is to apply the --query option to select your output. 如果只需要在集合的一部分上使用mongodump ,则通常情况是应用--query选项来选择输出。 There is no "limit" modifier to the ouput so instead you need to apply the "range query" operators, which are $lte and $gt repectively. 输出没有“限制”修饰符,因此您需要应用“范围查询”运算符,分别为$lte$gt

As a trivial example set, consider the following data: 作为一个简单的示例集,请考虑以下数据:

{ "_id" : ObjectId("560e4a56a1a451fc8a37057f"), "list" : [ 1, 2, 3 ] }
{ "_id" : ObjectId("560e4a5ca1a451fc8a370580"), "list" : [ 1, 2 ] }
{ "_id" : ObjectId("560e4a62a1a451fc8a370581"), "list" : [ 1 ] }
{ "_id" : ObjectId("560e4a6ca1a451fc8a370582"), "list" : [ ] }

So the idea is to get the _id value at the "cut point(s)" that you want and construct range queries to only select the documents within those ranges. 因此,其想法是在所需的“剪切点”处获得_id值,并构造范围查询以仅选择那些范围内的文档。 For this example we will just break outputs into groups of two. 在此示例中,我们将输出分成两组。

So the first thing you want is the _id of the second document ( being in twos ), which you can retrieve by applying .skip() and .limit() within the mongo shell: 因此,您想要的第一件事是第二个文档的_id (在.skip() ),您可以通过在mongo shell中应用.skip().limit()来检索它:

db.sample.find().sort({ "_id": 1 }).skip(1).limit(1)

That is just going to return the document: 那只是返回文档:

{ "_id" : ObjectId("560e4a5ca1a451fc8a370580"), "list" : [ 1, 2 ] }

Which is done by skipping over n-1 documents to the number you want to export in this batch and then just output the last document. 这是通过将n-1文档跳过到要在此批次中导出的编号,然后仅输出最后一个文档来完成的。

The issed mongodump would then contain the range selector for $lte to just come up to that point: 然后,已发行的mongodump将包含$lte的范围选择器,以达到该点:

mongodump -d test -c sample \
--query '{ "_id": { "$lte": { "$oid": "560e4a5ca1a451fc8a370580" } } }' \
--out part1

Note the $oid within the query. 注意查询中的$oid The mongodump and mongoimport tools both use the "strict" form described in MongoDB Extended JSON . mongodumpmongoimport工具都使用MongoDB Extended JSON中描述的“严格”形式。 Helper constructors like ObjectId() available to the shell are not "strictly JSON", and tools like mongodump ( or anything with the --query option ) just use JSON as input, so such data is instead represented in this form. 可用于外壳程序的辅助构造函数(例如ObjectId()不是“严格地JSON”,而mongodump (或带有--query选项的任何东西)之类的工具仅使用JSON作为输入,因此此类数据以这种形式表示。

For your next point you want to get the next n documents you can fit in the dump. 对于下一点,您希望获得可以放入转储中的下n文档。 So you want to query for the next document cut-off by either skipping the n documents already output plus the number of documents to the next cut off point n-1 , or basically ( 2 + 2 -1 ) = 3 : 因此,您可以通过跳过已经输出的n文档以及到下一个截止点n-1的文档数来查询下一个文档截止点,或者基本上( 2 + 2 -1 ) = 3

db.sample.find().sort({ "_id": 1 }).skip(3).limit(1)

Or even better, apply the range with $gt from the last cut-off you had: 甚至更好的是,将$gt的范围应用到上次截止的日期:

db.sample.find({ "_id": { "$gt": ObjectId("560e4a5ca1a451fc8a370580") }}).skip(1).limit(1)

Either way gets you the next cut-off document: 无论哪种方式,都可以获取下一份截止文档:

{ "_id" : ObjectId("560e4a6ca1a451fc8a370582"), "list" : [ ] }

Then apply another range query on the dump, but this time using "both" the $gt and $lte operators: 然后对转储应用另一个范围查询,但是这次使用$gt$lte运算符“两者”:

mongodump -d test -c sample \
--query '{ "_id": { 
    "$gt": { "$oid": "560e4a5ca1a451fc8a370580" },
    "$lte": { "$oid": "560e4a6ca1a451fc8a370582" } }}' \
--out part2

As with each part you can take the data and move it over to the target host as you require. 与每个部分一样,您可以获取数据并将其根据需要移至目标主机。 Note that in this form --out specifies a directory where the files 请注意,在这种形式下,-- --out指定文件所在的目录

Note that there are options that can help here as well, such as : 请注意,有些选项也可以在这里提供帮助,例如:

--host - (ideally from mongorestore ) Which can allow you to run the whole process on another system. --host - (最好从mongorestore ),它可以让你到另一个系统上运行的全过程。 So for example you could run the following in your new target MongoDB instance to pipe data from the origin host directly into mongorestore on that system: 因此,例如,您可以在新的目标MongoDB实例中运行以下命令,以将数据从原始主机直接管道传输到该系统上的mongorestore中:

mongodump --host orighost -d test -c sample \
--query '{ "_id": { 
    "$gt": { "$oid": "560e4a5ca1a451fc8a370580" },
    "$lte": { "$oid": "560e4a6ca1a451fc8a370582" } }}' \
--out - \
| mongorestore -d newtest -c newsample --dir -

Noting the - denotes standard output/input repectively for each command. 注意-表示每个命令的标准输出/输入。

--gzip - If you have MongoDB 3.2 on both hosts, then you can also take advantage of this option to compress/decompress either the data output or stream as in the pipe above. --gzip如果两个主机上都具有MongoDB 3.2,则也可以利用此选项来压缩/解压缩数据输出或流,如上面的管道中所示。 Combined with that piping option it would be the most effecient way to migrate data to the new target host. 结合使用该管道选项,这将是将数据迁移到新目标主机的最有效方法。

As for mongorestore in general, however you apply it the data will happily rebuild the collection, even in parts. 至于一般的mongorestore ,但是您应用它,数据将很高兴地重建集合,即使是部分也是如此。 The general behaviour is as marked "Insert Only" , so different restores will "add" to a collection but never "overwrite" data with the same _id value. 一般行为被标记为“仅插入” ,因此不同的还原将“添加”到集合中,但绝不会“覆盖”具有相同_id值的数据。

Look at the options carefully. 仔细查看选项。 As expecially if your host systems are both on EC2 or even both within general cloud resources, then there really should be no reason why you could not pipe output from one directly to the other. 特别是,如果您的主机系统都在EC2上,甚至都在通用云资源中,那么就没有理由不将输出从一个管道直接传递到另一个管道。 All that would be required is a little firewall configuration at most between the allowed hosts. 所需要做的就是最多在允许的主机之间进行一点防火墙配置。

But if at any rate you just want to backup "partial" data, then this is usually the way to go about doing it. 但是,无论如何,如果您只想备份“部分”数据,那么通常这就是进行备份的方法。

Of course, depending on your own setup and authentication needs, both commands are likely to require other options than those demonstrated here. 当然,根据您自己的设置和身份验证需求,这两个命令可能都需要此处显示的其他选项。 The options here are just the "required" options in order to specify a "collection" from a "database" and filter with a "query". 这里的选项仅是“必需”选项,以便从“数据库”中指定“集合”并使用“查询”进行过滤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM