为什么 netCDF4 文件大小与写入方式如此不同？

Question

I have several text files store 2 dimension data(same shape) with different times and different groups.我有几个文本文件存储具有不同时间和不同组的二维数据（相同形状）。 Now I want to convert these data to ONE netCDF file with several netCDF Groups.现在我想将这些数据转换为一个包含多个 netCDF 组的 netCDF 文件。 Each group's variable has the same dimensions like: dimensions:{time=62, lat=118, lon=104} .每个组的变量具有相同的维度，例如： dimensions:{time=62, lat=118, lon=104} 。 And I write the data in three ways.我以三种方式写入数据。 Codes are written in python3.7 and netCDF4 package.代码用python3.7和netCDF4 package编写。

from netCDF4 import Dataset, date2num, date2index
import numpy as np
import os
from datetime import datetime, timedelta


def initialize(fpath):
    rootgrp = Dataset(fpath, 'w')
    rootgrp.createDimension('time', 62)
    rootgrp.createDimension('lat', 118)
    rootgrp.createDimension('lon', 104)

    times = rootgrp.createVariable('time', 'f8', ('time', ))
    lats = rootgrp.createVariable('lat', 'f4', ('lat', ))
    lons = rootgrp.createVariable('lon', 'f4', ('lon', ))

    lats.units = 'degrees north'
    lons.units = 'degrees east'
    times.units = 'hours since 1900-01-01 00:00:00.0'
    times.calendar = 'gregorian'
    datetimes = [
        datetime(2020, 3, 1, 8) + n * timedelta(hours=12) for n in range(62)
    ]

    lats[:] = np.linspace(-40, 40, 118)
    lons[:] = np.linspace(80, 160, 104)
    times[:] = date2num(datetimes, times.units, times.calendar)
    return rootgrp


def write(fpath, data, **kwargs):
    if not os.path.exists(fpath):
        rootgrp = initialize(fpath)
    else:
        rootgrp = Dataset(fpath, 'r+')

    grppath = kwargs['grppath']
    varname = kwargs['varname']
    grp = rootgrp.createGroup(grppath)
    if varname in grp.variables:
        var = grp.variables[varname]
    else:
        var = grp.createVariable(varname,
                                 'f4', ('time', 'lat', 'lon'),
                                 zlib=True,
                                 least_significant_digit=1)

    times = rootgrp.variables['time']
    datetimes = kwargs.get('datetimes', None)
    if datetimes is None:
        time_index = slice(None)
    else:
        time_index = date2index(datetimes, times, calendar=times.calendar)

    print(var[time_index, :, :].shape)
    print(data.shape)
    var[time_index, :, :] = data
    rootgrp.close()


def get_data(groups, datetimes):
    shape = (118, 104)
    size = shape[0] * shape[1]
    all_group = {}
    for group in groups:
        data_list = []
        for time in datetimes:
            data = np.random.random(size).reshape(shape)
            data_list.append(data)
        all_group[group] = data_list
    return all_group


def way1(dateimes, grouped_data):
    for i, time in enumerate(datetimes):
        for group, data in grouped_data.items():
            write('way1.nc',
                  data[i],
                  grppath=group,
                  varname='random',
                  datetimes=time)


def way2(datetimes, grouped_data):
    for group in grouped_data:
        all_data = np.stack(grouped_data[group])
        write('way2.nc',
              all_data,
              grppath=group,
              varname='random',
              datetimes=datetimes)


def way3(datetimes, grouped_data):
    for group, data in grouped_data.items():
        for i, time in enumerate(datetimes):
            write('way3.nc',
                  data[i],
                  grppath=group,
                  varname='random',
                  datetimes=time)


groups = list('abcdefghijklmnopqrstuvwxyz')
datetimes = [
    datetime(2020, 3, 1, 8) + n * timedelta(hours=12) for n in range(62)
]
grouped_data = get_data(groups, datetimes)
way1(datetimes, grouped_data)
way2(datetimes, grouped_data)
way3(datetimes, grouped_data)

The files written by the three ways are all the same(Variable's ChunkSizes = (62U, 118U, 104U)) except the file size.三种方式写入的文件都是一样的（变量的ChunkSizes = (62U, 118U, 104U)），只是文件大小不同。

way 1: 495,324,392 Bytes（disk's 503.3 MB)方式一：495,324,392 Bytes（磁盘503.3 MB）

way 2: 15,608,108 Bytes（disk's 16.7 MB)方式2：15,608,108 Bytes（磁盘16.7 MB）

way 3: 15,608,108 Bytes（disk's 16.7 MB)方式3：15,608,108 Bytes（磁盘16.7 MB）

I'm wondering if anyone could explain for me.我想知道是否有人可以为我解释。 Thanks!谢谢！

Answer 1

Not a complete answer but I have to go sleep now and want to share what I found out so far.不是一个完整的答案，但我现在必须 go 睡觉，并想分享我到目前为止发现的内容。 The h5ls output showed indeed that all datasets are the same size and chunks, so that's not the issue. h5ls output 确实表明所有数据集的大小和块都相同，所以这不是问题。

In your program you test if a netCDF file or variable exists, and then only create it if it doesn't exist yet.在您的程序中，您测试 netCDF 文件或变量是否存在，然后仅在尚不存在时创建它。 However, you don't test for groups, you always create them.但是，您不测试组，您总是创建它们。 By changing grp = rootgrp.createGroup(grppath) into the following lines, the size of way1.nc is reduced to 19 MB.通过将grp = rootgrp.createGroup(grppath)更改为以下行， way1.nc的大小减少到 19 MB。

if grppath in rootgrp.groups:
    grp = rootgrp[grppath]
else:
    grp = rootgrp.createGroup(grppath)

When you delete an object from an HDF5 file, the file size remains the same (see section 5.5.2. deleting a Dataset from a File and Reclaiming Space of the HDF5 user guide ).从 HDF5 文件中删除 object 时，文件大小保持不变（请参阅HDF5 用户指南的第 5.5.2 节。从文件中删除数据集和回收空间）。 So I suspect that creating a group with the same name over and over again will allocate storage space but doesn't free the old group's disk space, thus creating a memory leak.所以我怀疑一遍又一遍地创建一个同名的组会分配存储空间但不会释放旧组的磁盘空间，从而造成 memory 泄漏。 I don't know why this happens only in way 1, and not in way 3.我不知道为什么这种情况只发生在方式 1 中，而不发生在方式 3 中。

Also I don't understand yet why way1.nc is still slightly larger (19 MB) than the others (15 MB).另外我还不明白为什么way1.nc仍然比其他文件（15 MB）略大（19 MB）。

Finally, because you only call the initialize function if the netCDF file doesn't exist, you must be careful to delete the output of the previous run before you start the program.最后，因为只有在netCDF文件不存在的情况下才调用initialize function，所以在启动程序之前一定要小心删除之前运行的output。 You can easily forget this so I recommend that you modify your code so that initialize is always executed at the program start.您很容易忘记这一点，因此我建议您修改代码，以便始终在程序启动时执行initialize 。

为什么 netCDF4 文件大小与写入方式如此不同？

问题描述

1 个解决方案

解决方案1
0 2020-04-20 00:08:46

为什么 netCDF4 文件大小与写入方式如此不同？

问题描述

1 个解决方案

解决方案1 0 2020-04-20 00:08:46

解决方案1
0 2020-04-20 00:08:46