ZFS SSD 池 - 对于大型表 ALTER 命令写入操作，Postgres 设置真的很慢）

Question

We recently moved a db(1.2TB) cluster from mirrored SSD to ZFS pool build-up of SSD.我们最近将一个 db(1.2TB) 集群从镜像 SSD 移动到 SSD 的 ZFS 池构建。 After the move, I have seen a massive drop in performance on large write operations (Alter table types, vacuums, index creation etc.).迁移之后，我发现大型写入操作（更改表类型、清理、索引创建等）的性能大幅下降。

To Isolate the problem I did the following, copied the 361 GB table and ensured no triggers are active, try to run the following command, original type as timestamp为了隔离问题，我做了以下操作，复制了 361 GB 表并确保没有触发器处于活动状态，尝试运行以下命令，原始类型为时间戳

ALTER TABLE table_log_test ALTER COLUMN date_executed TYPE timestamptz;

This takes about 3 hours to complete, make sense it needs to touch every one of the 60 mil rows, however, this takes about 10 min on the test system only on SSD这需要大约 3 小时才能完成，有意义的是它需要触摸 60 百万行中的每一行，但是，仅在 SSD 上的测试系统上这大约需要 10 分钟

Comparing the alter commands zpool iostat outputs to fio I get the following results将更改命令zpool iostat输出与fio进行比较，我得到以下结果

Alter command更改命令

pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.33T  5.65T  6.78K  5.71K  31.9M   191M
  raidz1    1.33T  5.65T  6.78K  5.71K  31.9M   191M
    sda         -      -  1.94K  1.34K  9.03M  48.6M
    sdb         -      -  1.62K  1.45K  7.66M  48.5M
    sdc         -      -  1.62K  1.46K  7.66M  48.3M
    sdd         -      -  1.60K  1.45K  7.59M  45.5M

FIO FIO

fio --ioengine=libaio --filename=tank --size=10G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randrw --bs=1MB --numjobs=32

pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.34T  5.65T     14  27.5K  59.8K   940M
  raidz1    1.34T  5.65T     14  27.5K  59.8K   940M
    sda         -      -      5  7.14K  23.9K   235M
    sdb         -      -      1  7.02K  7.97K   235M
    sdc         -      -      4  7.97K  19.9K   235M
    sdd         -      -      1  5.33K  7.97K   235M

So it seems to me the zfs is working fine, it's just an interaction with PostgreSQL that's slow.所以在我看来，zfs 工作正常，只是与 PostgreSQL 的交互很慢。

What settings have I played with我玩过什么设置

ZFS ZFS

recordsize = 16KB changed from 128KB
logbias = Latency , throughput preformed worse
compression = lz4 
primarycache = all , we have large write and reads
NO ARC or ZIL enabled

Postgres settings Postgres 设置

full_page_writes=off
shared_buffers = 12GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
random_page_cost = 1.2
effective_io_concurrency = 200
work_mem = 256MB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

and tried 
synchronous_commit = off , didn't see any performance increase

As a note synchronous_commit and full_page_writes I only did a Postgres config reload as this is a production site.作为注释synchronous_commit和full_page_writes我只重新加载了 Postgres 配置，因为这是一个生产站点。 I see some guys do restarts while some documentation states it only requires to reload.我看到有些人确实会重新启动，而有些文档指出它只需要重新加载。 Reloads shows in psql if I SHOW setting .如果我显示设置，则在 psql 中重新加载显示。

At this point, I am a bit lost on what to try next.在这一点上，我有点迷失了下一步该尝试什么。 I am also unsure if the reload vs restart may be the reason I don't see the gains others are talking about.我也不确定重新加载与重新启动是否可能是我看不到其他人正在谈论的收益的原因。

As a side note.作为旁注。 Vacuum full analyze didn't help either, not that I expected it to on a new copied table.真空完整分析也无济于事，并不是我期望它出现在新的复制表上。

Thanks in advance for your help在此先感谢您的帮助

UPDATE 1 I amended my fio commands as suggested by jjanes, Here are the outputs更新 1我按照 jjanes 的建议修改了我的 fio 命令，这是输出

First one is based on jjanes suggestion.第一个是基于 jjanes 的建议。

fio --ioengine=psync --filename=tank --size=10G --time_based --name=fio --group_reporting --runtime=10 --rw=rw --rwmixread=50 --bs=8KB 

fio: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
fio: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=91.6MiB/s,w=90.2MiB/s][r=11.7k,w=11.6k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=3406394: Tue Dec 28 08:11:06 2021
  read: IOPS=16.5k, BW=129MiB/s (135MB/s)(1292MiB/10001msec)
    clat (usec): min=2, max=15165, avg=25.87, stdev=120.57
     lat (usec): min=2, max=15165, avg=25.94, stdev=120.57
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    6], 60.00th=[    9],
     | 70.00th=[   43], 80.00th=[   48], 90.00th=[   57], 95.00th=[   68],
     | 99.00th=[  153], 99.50th=[  212], 99.90th=[  457], 99.95th=[  963],
     | 99.99th=[ 7504]
   bw (  KiB/s): min=49392, max=209248, per=99.76%, avg=131997.16, stdev=46361.80, samples=19
   iops        : min= 6174, max=26156, avg=16499.58, stdev=5795.23, samples=19
  write: IOPS=16.5k, BW=129MiB/s (135MB/s)(1291MiB/10001msec); 0 zone resets
    clat (usec): min=5, max=22574, avg=33.29, stdev=117.32
     lat (usec): min=5, max=22574, avg=33.40, stdev=117.32
    clat percentiles (usec):
     |  1.00th=[    7],  5.00th=[    8], 10.00th=[    8], 20.00th=[    9],
     | 30.00th=[   10], 40.00th=[   11], 50.00th=[   13], 60.00th=[   14],
     | 70.00th=[   17], 80.00th=[   22], 90.00th=[  113], 95.00th=[  133],
     | 99.00th=[  235], 99.50th=[  474], 99.90th=[ 1369], 99.95th=[ 2073],
     | 99.99th=[ 3720]
   bw (  KiB/s): min=49632, max=205664, per=99.88%, avg=132066.26, stdev=46268.55, samples=19
   iops        : min= 6204, max=25708, avg=16508.00, stdev=5783.26, samples=19
  lat (usec)   : 4=16.07%, 10=30.97%, 20=23.77%, 50=15.29%, 100=7.37%
  lat (usec)   : 250=5.94%, 500=0.30%, 750=0.10%, 1000=0.07%
  lat (msec)   : 2=0.08%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=3.47%, sys=72.13%, ctx=19573, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=165413,165306,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1292MiB (1355MB), run=10001-10001msec
  WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1291MiB (1354MB), run=10001-10001msec

Second one is from https://subscription.packtpub.com/book/big-data-and-business-intelligence/9781785284335/1/ch01lvl1sec14/checking-iops第二个来自https://subscription.packtpub.com/book/big-data-and-business-intelligence/9781785284335/1/ch01lvl1sec14/checking-iops

fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=tank --bs=8k --iodepth=32 --size=10G --readwrite=rw --rwmixread=50

test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=158MiB/s,w=157MiB/s][r=20.3k,w=20.1k IOPS][eta 00m:00s] 
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3484893: Tue Dec 28 08:13:31 2021
  read: IOPS=17.7k, BW=138MiB/s (145MB/s)(5122MiB/36990msec)
    slat (usec): min=2, max=33046, avg=31.73, stdev=95.75
    clat (nsec): min=1691, max=34831k, avg=878259.94, stdev=868723.61
     lat (usec): min=6, max=34860, avg=910.14, stdev=883.09
    clat percentiles (usec):
     |  1.00th=[  306],  5.00th=[  515], 10.00th=[  545], 20.00th=[  586],
     | 30.00th=[  619], 40.00th=[  652], 50.00th=[  693], 60.00th=[  742],
     | 70.00th=[  807], 80.00th=[  955], 90.00th=[ 1385], 95.00th=[ 1827],
     | 99.00th=[ 2933], 99.50th=[ 3851], 99.90th=[14877], 99.95th=[17433],
     | 99.99th=[23725]
   bw (  KiB/s): min=48368, max=205616, per=100.00%, avg=142130.51, stdev=34694.67, samples=73
   iops        : min= 6046, max=25702, avg=17766.29, stdev=4336.81, samples=73
  write: IOPS=17.7k, BW=138MiB/s (145MB/s)(5118MiB/36990msec); 0 zone resets
    slat (usec): min=6, max=18233, avg=22.24, stdev=85.73
    clat (usec): min=6, max=34848, avg=871.98, stdev=867.03
     lat (usec): min=15, max=34866, avg=894.36, stdev=898.46
    clat percentiles (usec):
     |  1.00th=[  302],  5.00th=[  515], 10.00th=[  545], 20.00th=[  578],
     | 30.00th=[  611], 40.00th=[  644], 50.00th=[  685], 60.00th=[  734],
     | 70.00th=[  807], 80.00th=[  955], 90.00th=[ 1385], 95.00th=[ 1811],
     | 99.00th=[ 2868], 99.50th=[ 3687], 99.90th=[15008], 99.95th=[17695],
     | 99.99th=[23987]
   bw (  KiB/s): min=47648, max=204688, per=100.00%, avg=142024.70, stdev=34363.25, samples=73
   iops        : min= 5956, max=25586, avg=17753.07, stdev=4295.39, samples=73
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=0.16%, 500=3.61%, 750=58.52%, 1000=19.22%
  lat (msec)   : 2=14.79%, 4=3.25%, 10=0.25%, 20=0.19%, 50=0.02%
  cpu          : usr=4.36%, sys=85.41%, ctx=28323, majf=0, minf=447
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=655676,655044,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=5122MiB (5371MB), run=36990-36990msec
  WRITE: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=5118MiB (5366MB), run=36990-36990msec

Conclusions结论

So it turns out the major issue for poor performance was write amplification.所以事实证明，性能不佳的主要问题是写放大。 The below post has a good comment on this by Dunuin https://www.linuxbabe.com/mail-server/setup-basic-postfix-mail-sever-ubuntu Dunuin https://www.linuxbabe.com/mail-server/setup-basic-postfix-mail-sever-ubuntu对此帖子有很好的评论

In short summary简而言之

4K writes where the primary writes for the alter command 4K 写入，主要写入更改命令的位置
Adding dedicated SLOG helped添加专用 SLOG 有帮助
Adding dedicated ARC helped添加专用 ARC 有帮助
Moving WAHL files to separate tanks helped将 WAHL 文件移动到单独的罐中有助于
Changing record size to 16Kb helped将记录大小更改为 16Kb 有帮助
Disabling sync writes on WAHL helped.在 WAHL 上禁用同步写入有所帮助。

One thing I didn't try was recompling Postgres in 32Kb pages.我没有尝试的一件事是在 32Kb 页面中重新编译 Postgres。 Based on what I have seen this could have a significant performance impact and is worth investigating if you are installing a new cluster.根据我所见，这可能会对性能产生重大影响，如果您要安装新集群，则值得研究。

Thanks to everyone for their input into this problem.感谢大家对这个问题的投入。 Hope this info helps someone else.希望此信息对其他人有所帮助。

Answer 1

I wonder how you created the zfs pool.我想知道您是如何创建 zfs 池的。 Once I forgot the ashift=12 option when creating the zfs pool.一旦我在创建 zfs 池时忘记了 ashift=12 选项。

Maybe check this option with zdb.也许用 zdb 检查这个选项。 ( https://charsiurice.wordpress.com/2016/05/30/checking-ashift-on-existing-pools/ ) （ https://charsiurice.wordpress.com/2016/05/30/checking-ashift-on-existing-pools/ ）

ZFS SSD 池 - 对于大型表 ALTER 命令写入操作，Postgres 设置真的很慢）

问题描述

1 个解决方案

解决方案1
0 2022-07-23 07:14:50

ZFS SSD 池 - 对于大型表 ALTER 命令写入操作，Postgres 设置真的很慢）

问题描述

1 个解决方案

解决方案1 0 2022-07-23 07:14:50

解决方案1
0 2022-07-23 07:14:50