We recently moved a db(1.2TB) cluster from mirrored SSD to ZFS pool build-up of SSD. After the move, I have seen a massive drop in performance on large write operations (Alter table types, vacuums, index creation etc.).
To Isolate the problem I did the following, copied the 361 GB table and ensured no triggers are active, try to run the following command, original type as timestamp
ALTER TABLE table_log_test ALTER COLUMN date_executed TYPE timestamptz;
This takes about 3 hours to complete, make sense it needs to touch every one of the 60 mil rows, however, this takes about 10 min on the test system only on SSD
Comparing the alter commands zpool iostat outputs to fio I get the following results
Alter command
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.33T 5.65T 6.78K 5.71K 31.9M 191M
raidz1 1.33T 5.65T 6.78K 5.71K 31.9M 191M
sda - - 1.94K 1.34K 9.03M 48.6M
sdb - - 1.62K 1.45K 7.66M 48.5M
sdc - - 1.62K 1.46K 7.66M 48.3M
sdd - - 1.60K 1.45K 7.59M 45.5M
FIO
fio --ioengine=libaio --filename=tank --size=10G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randrw --bs=1MB --numjobs=32
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.34T 5.65T 14 27.5K 59.8K 940M
raidz1 1.34T 5.65T 14 27.5K 59.8K 940M
sda - - 5 7.14K 23.9K 235M
sdb - - 1 7.02K 7.97K 235M
sdc - - 4 7.97K 19.9K 235M
sdd - - 1 5.33K 7.97K 235M
So it seems to me the zfs is working fine, it's just an interaction with PostgreSQL that's slow.
What settings have I played with
ZFS
recordsize = 16KB changed from 128KB
logbias = Latency , throughput preformed worse
compression = lz4
primarycache = all , we have large write and reads
NO ARC or ZIL enabled
Postgres settings
full_page_writes=off
shared_buffers = 12GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
random_page_cost = 1.2
effective_io_concurrency = 200
work_mem = 256MB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
and tried
synchronous_commit = off , didn't see any performance increase
As a note synchronous_commit and full_page_writes I only did a Postgres config reload as this is a production site. I see some guys do restarts while some documentation states it only requires to reload. Reloads shows in psql if I SHOW setting .
At this point, I am a bit lost on what to try next. I am also unsure if the reload vs restart may be the reason I don't see the gains others are talking about.
As a side note. Vacuum full analyze didn't help either, not that I expected it to on a new copied table.
Thanks in advance for your help
UPDATE 1 I amended my fio commands as suggested by jjanes, Here are the outputs
First one is based on jjanes suggestion.
fio --ioengine=psync --filename=tank --size=10G --time_based --name=fio --group_reporting --runtime=10 --rw=rw --rwmixread=50 --bs=8KB
fio: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
fio: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=91.6MiB/s,w=90.2MiB/s][r=11.7k,w=11.6k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=3406394: Tue Dec 28 08:11:06 2021
read: IOPS=16.5k, BW=129MiB/s (135MB/s)(1292MiB/10001msec)
clat (usec): min=2, max=15165, avg=25.87, stdev=120.57
lat (usec): min=2, max=15165, avg=25.94, stdev=120.57
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
| 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 6], 60.00th=[ 9],
| 70.00th=[ 43], 80.00th=[ 48], 90.00th=[ 57], 95.00th=[ 68],
| 99.00th=[ 153], 99.50th=[ 212], 99.90th=[ 457], 99.95th=[ 963],
| 99.99th=[ 7504]
bw ( KiB/s): min=49392, max=209248, per=99.76%, avg=131997.16, stdev=46361.80, samples=19
iops : min= 6174, max=26156, avg=16499.58, stdev=5795.23, samples=19
write: IOPS=16.5k, BW=129MiB/s (135MB/s)(1291MiB/10001msec); 0 zone resets
clat (usec): min=5, max=22574, avg=33.29, stdev=117.32
lat (usec): min=5, max=22574, avg=33.40, stdev=117.32
clat percentiles (usec):
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 10], 40.00th=[ 11], 50.00th=[ 13], 60.00th=[ 14],
| 70.00th=[ 17], 80.00th=[ 22], 90.00th=[ 113], 95.00th=[ 133],
| 99.00th=[ 235], 99.50th=[ 474], 99.90th=[ 1369], 99.95th=[ 2073],
| 99.99th=[ 3720]
bw ( KiB/s): min=49632, max=205664, per=99.88%, avg=132066.26, stdev=46268.55, samples=19
iops : min= 6204, max=25708, avg=16508.00, stdev=5783.26, samples=19
lat (usec) : 4=16.07%, 10=30.97%, 20=23.77%, 50=15.29%, 100=7.37%
lat (usec) : 250=5.94%, 500=0.30%, 750=0.10%, 1000=0.07%
lat (msec) : 2=0.08%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=3.47%, sys=72.13%, ctx=19573, majf=0, minf=28
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=165413,165306,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1292MiB (1355MB), run=10001-10001msec
WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1291MiB (1354MB), run=10001-10001msec
Second one is from https://subscription.packtpub.com/book/big-data-and-business-intelligence/9781785284335/1/ch01lvl1sec14/checking-iops
fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=tank --bs=8k --iodepth=32 --size=10G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=158MiB/s,w=157MiB/s][r=20.3k,w=20.1k IOPS][eta 00m:00s]
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3484893: Tue Dec 28 08:13:31 2021
read: IOPS=17.7k, BW=138MiB/s (145MB/s)(5122MiB/36990msec)
slat (usec): min=2, max=33046, avg=31.73, stdev=95.75
clat (nsec): min=1691, max=34831k, avg=878259.94, stdev=868723.61
lat (usec): min=6, max=34860, avg=910.14, stdev=883.09
clat percentiles (usec):
| 1.00th=[ 306], 5.00th=[ 515], 10.00th=[ 545], 20.00th=[ 586],
| 30.00th=[ 619], 40.00th=[ 652], 50.00th=[ 693], 60.00th=[ 742],
| 70.00th=[ 807], 80.00th=[ 955], 90.00th=[ 1385], 95.00th=[ 1827],
| 99.00th=[ 2933], 99.50th=[ 3851], 99.90th=[14877], 99.95th=[17433],
| 99.99th=[23725]
bw ( KiB/s): min=48368, max=205616, per=100.00%, avg=142130.51, stdev=34694.67, samples=73
iops : min= 6046, max=25702, avg=17766.29, stdev=4336.81, samples=73
write: IOPS=17.7k, BW=138MiB/s (145MB/s)(5118MiB/36990msec); 0 zone resets
slat (usec): min=6, max=18233, avg=22.24, stdev=85.73
clat (usec): min=6, max=34848, avg=871.98, stdev=867.03
lat (usec): min=15, max=34866, avg=894.36, stdev=898.46
clat percentiles (usec):
| 1.00th=[ 302], 5.00th=[ 515], 10.00th=[ 545], 20.00th=[ 578],
| 30.00th=[ 611], 40.00th=[ 644], 50.00th=[ 685], 60.00th=[ 734],
| 70.00th=[ 807], 80.00th=[ 955], 90.00th=[ 1385], 95.00th=[ 1811],
| 99.00th=[ 2868], 99.50th=[ 3687], 99.90th=[15008], 99.95th=[17695],
| 99.99th=[23987]
bw ( KiB/s): min=47648, max=204688, per=100.00%, avg=142024.70, stdev=34363.25, samples=73
iops : min= 5956, max=25586, avg=17753.07, stdev=4295.39, samples=73
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.16%, 500=3.61%, 750=58.52%, 1000=19.22%
lat (msec) : 2=14.79%, 4=3.25%, 10=0.25%, 20=0.19%, 50=0.02%
cpu : usr=4.36%, sys=85.41%, ctx=28323, majf=0, minf=447
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=655676,655044,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=5122MiB (5371MB), run=36990-36990msec
WRITE: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=5118MiB (5366MB), run=36990-36990msec
Conclusions
So it turns out the major issue for poor performance was write amplification. The below post has a good comment on this by Dunuin https://www.linuxbabe.com/mail-server/setup-basic-postfix-mail-sever-ubuntu
In short summary
One thing I didn't try was recompling Postgres in 32Kb pages. Based on what I have seen this could have a significant performance impact and is worth investigating if you are installing a new cluster.
Thanks to everyone for their input into this problem. Hope this info helps someone else.
I wonder how you created the zfs pool. Once I forgot the ashift=12 option when creating the zfs pool.
Maybe check this option with zdb. ( https://charsiurice.wordpress.com/2016/05/30/checking-ashift-on-existing-pools/ )
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.