简体   繁体   English

巨型数据库上的pg_dump和pg_restore

[英]pg_dump and pg_restore on giant databases

I have currently a task to improve a database-structure. 我目前有一个改善数据库结构的任务。 For this we want to effectively dump and restore one single giant database. 为此,我们希望有效地转储和还原单个巨型数据库。 ( approx. 1TB and growing ) 大约1TB并在增长

To test things with this database, we wanted to transfer this database to another server-node, and this via pg_dump and pg_restore . 为了测试该数据库的性能,我们希望将此数据库通过pg_dumppg_restore转移到另一个服务器节点。

We are running a v10 ( https://www.postgresql.org/docs/10/app-pgdump.html ) Server, so we are limited to their possible parameters. 我们正在运行v10( https://www.postgresql.org/docs/10/app-pgdump.html )服务器,因此我们仅限于其可能的参数。 It is also required to dump the full database, and not only parts. 需要转储整个数据库,而不仅仅是部分数据库。

For this I tried a couple of approaches, these sources helped a lot: 为此,我尝试了几种方法,这些资源提供了很多帮助:

and foremost: 最重要的是:

The problem is, that you can almost only improve one of these task, but not both simultaneously. 问题在于,您几乎只能改进其中一项任务,而不能同时改进两项任务。

Case 1 情况1

Dumping in directory format is extremely fast ( ~1 hour ), but restoring is not. 以目录格式进行转储的速度非常快( 〜1小时 ),但恢复速度却不是。

pg_dump --blobs --dbname="$DBNAME" --file=$DUMPDIR --format=directory --host=$SERVERHOSTNAME --jobs=$THREADS --port=$SERVERPORT--username="$SERVERUSERNAME"
pg_restore --clean --create --format=directory --jobs=$THREADS --host=$SERVERHOSTNAME --port=$SERVERPORT --username="$SERVERUSERNAME" "./"

Problem about this restore-method is, even though I assigned multiple cores to it, it only uses one, with barely 4% CPU used on the server-core. 关于此还原方法的问题是,即使我为其分配了多个内核,它也仅使用一个,而在服务器内核上仅使用了4%的CPU。

Case 2 情况二

Dumping in custom format is extremely slow, that the server even couldn't complete it overnight (Session timeout). 以自定义格式进行转储的过程非常缓慢,服务器甚至无法在一夜之间完成它(会话超时)。

pg_dump --blobs --compress=9 --dbname="$dbname" --file="$DUMPDIR/db.dump" --format=custom --host=$SERVERHOSTNAME --port=$SERVERPORT --username=$SERVERUSERNAME

So I had different approaches in mind: 所以我想到了不同的方法:

  1. dump it with approach #1, convert it afterwards (how?) and use a faster restore method (variant #2? ) 使用方法#1进行转储,然后将其转换(如何?)并使用更快的还原方法(方法#2?)
  2. Creating multiple dumps simultaniously on different cores but with different schemas (Having a total of 6), and then merge them back (how?) 在不同的内核上同时使用不同的模式(总共有6个)同时创建多个转储,然后将它们合并回去(如何?)

Piping seems to be an ineffective way of dumping according to the author stated above. 根据上面的作者所述,管道似乎是一种无效的转储方式。

Does anyone have more experience in this? 有人在这方面有更多经验吗? And are my approach-ideas useful, or do you have a complete different solution in mind? 我的方法思想有用吗,还是您有一个完全不同的解决方案?

Oh, before I forget: We are currently limited to 5TB on our external server, and the internal server which runs the db should not get bloated with data-fragments, even temporarily. 哦,在我忘记之前:我们目前在外部服务器上的内存限制为5TB,运行数据库的内部服务器不应因数据碎片而data肿,即使是暂时的。

A parallel pg_restore with the directory format should speed up processing. 具有目录格式的并行pg_restore应该加快处理速度。

If it doesn't, I suspect that much of the data is in one large table, which pg_restore (and pg_dump ) cannot parallelize. 如果没有,我怀疑很多数据都在一个大表中,而pg_restore (和pg_dump )无法并行化。

Make sure you disable compression ( -z 0 ) to improve the speed (unless you have a weak network). 确保禁用压缩( -z 0 )以提高速度(除非网络较弱)。

You might be considerably faster with an online file system backup: 使用在线文件系统备份可能会更快:

  • pg_basebackup is simple, but cannot be parallelized. pg_basebackup很简单,但是不能并行化。

  • Using the low-level API , you can parallelize the backup with operating system or storage techniques. 使用低级API ,您可以将备份与操作系统或存储技术并行化。

The disadvantage is that with a file system backup, you can only copy the whole database cluster. 缺点是使用文件系统备份时,您只能复制整个数据库集群。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM