Why the netCDF4 file size are so different from the write ways?

Question

I have several text files store 2 dimension data(same shape) with different times and different groups. Now I want to convert these data to ONE netCDF file with several netCDF Groups. Each group's variable has the same dimensions like: dimensions:{time=62, lat=118, lon=104} . And I write the data in three ways. Codes are written in python3.7 and netCDF4 package.

from netCDF4 import Dataset, date2num, date2index
import numpy as np
import os
from datetime import datetime, timedelta


def initialize(fpath):
    rootgrp = Dataset(fpath, 'w')
    rootgrp.createDimension('time', 62)
    rootgrp.createDimension('lat', 118)
    rootgrp.createDimension('lon', 104)

    times = rootgrp.createVariable('time', 'f8', ('time', ))
    lats = rootgrp.createVariable('lat', 'f4', ('lat', ))
    lons = rootgrp.createVariable('lon', 'f4', ('lon', ))

    lats.units = 'degrees north'
    lons.units = 'degrees east'
    times.units = 'hours since 1900-01-01 00:00:00.0'
    times.calendar = 'gregorian'
    datetimes = [
        datetime(2020, 3, 1, 8) + n * timedelta(hours=12) for n in range(62)
    ]

    lats[:] = np.linspace(-40, 40, 118)
    lons[:] = np.linspace(80, 160, 104)
    times[:] = date2num(datetimes, times.units, times.calendar)
    return rootgrp


def write(fpath, data, **kwargs):
    if not os.path.exists(fpath):
        rootgrp = initialize(fpath)
    else:
        rootgrp = Dataset(fpath, 'r+')

    grppath = kwargs['grppath']
    varname = kwargs['varname']
    grp = rootgrp.createGroup(grppath)
    if varname in grp.variables:
        var = grp.variables[varname]
    else:
        var = grp.createVariable(varname,
                                 'f4', ('time', 'lat', 'lon'),
                                 zlib=True,
                                 least_significant_digit=1)

    times = rootgrp.variables['time']
    datetimes = kwargs.get('datetimes', None)
    if datetimes is None:
        time_index = slice(None)
    else:
        time_index = date2index(datetimes, times, calendar=times.calendar)

    print(var[time_index, :, :].shape)
    print(data.shape)
    var[time_index, :, :] = data
    rootgrp.close()


def get_data(groups, datetimes):
    shape = (118, 104)
    size = shape[0] * shape[1]
    all_group = {}
    for group in groups:
        data_list = []
        for time in datetimes:
            data = np.random.random(size).reshape(shape)
            data_list.append(data)
        all_group[group] = data_list
    return all_group


def way1(dateimes, grouped_data):
    for i, time in enumerate(datetimes):
        for group, data in grouped_data.items():
            write('way1.nc',
                  data[i],
                  grppath=group,
                  varname='random',
                  datetimes=time)


def way2(datetimes, grouped_data):
    for group in grouped_data:
        all_data = np.stack(grouped_data[group])
        write('way2.nc',
              all_data,
              grppath=group,
              varname='random',
              datetimes=datetimes)


def way3(datetimes, grouped_data):
    for group, data in grouped_data.items():
        for i, time in enumerate(datetimes):
            write('way3.nc',
                  data[i],
                  grppath=group,
                  varname='random',
                  datetimes=time)


groups = list('abcdefghijklmnopqrstuvwxyz')
datetimes = [
    datetime(2020, 3, 1, 8) + n * timedelta(hours=12) for n in range(62)
]
grouped_data = get_data(groups, datetimes)
way1(datetimes, grouped_data)
way2(datetimes, grouped_data)
way3(datetimes, grouped_data)

The files written by the three ways are all the same(Variable's ChunkSizes = (62U, 118U, 104U)) except the file size.

way 1: 495,324,392 Bytes（disk's 503.3 MB)

way 2: 15,608,108 Bytes（disk's 16.7 MB)

way 3: 15,608,108 Bytes（disk's 16.7 MB)

I'm wondering if anyone could explain for me. Thanks!

Answer 1

Not a complete answer but I have to go sleep now and want to share what I found out so far. The h5ls output showed indeed that all datasets are the same size and chunks, so that's not the issue.

In your program you test if a netCDF file or variable exists, and then only create it if it doesn't exist yet. However, you don't test for groups, you always create them. By changing grp = rootgrp.createGroup(grppath) into the following lines, the size of way1.nc is reduced to 19 MB.

if grppath in rootgrp.groups:
    grp = rootgrp[grppath]
else:
    grp = rootgrp.createGroup(grppath)

When you delete an object from an HDF5 file, the file size remains the same (see section 5.5.2. deleting a Dataset from a File and Reclaiming Space of the HDF5 user guide ). So I suspect that creating a group with the same name over and over again will allocate storage space but doesn't free the old group's disk space, thus creating a memory leak. I don't know why this happens only in way 1, and not in way 3.

Also I don't understand yet why way1.nc is still slightly larger (19 MB) than the others (15 MB).

Finally, because you only call the initialize function if the netCDF file doesn't exist, you must be careful to delete the output of the previous run before you start the program. You can easily forget this so I recommend that you modify your code so that initialize is always executed at the program start.

Why the netCDF4 file size are so different from the write ways?

Question

1 answers

solution1
0 2020-04-20 00:08:46

Why the netCDF4 file size are so different from the write ways?

Question

1 answers

solution1 0 2020-04-20 00:08:46

solution1
0 2020-04-20 00:08:46