使用 python 将数据流存储在 hdf5 文件中

Question

I have a python program that accepts a stream of data via UDP at a rate of +- 1000 Hz.我有一个 python 程序，它通过 UDP 以 +- 1000 Hz 的速率接受 stream 数据。 A typical stream takes +- 15 mins.典型的 stream 需要 +- 15 分钟。 It consists of +- 10 channels each consisting of a stream of doubles, booleans or vector of size 3 with a timestamp.它由 +- 10 个通道组成，每个通道由一个 stream 组成，该通道由双精度数、布尔值或大小为 3 的向量组成，带有时间戳。

Currently every iteration (so 1000 times a second) it writes a line to a csv file with all the values.目前，每次迭代（每秒 1000 次）它都会将一行写入 csv 文件，其中包含所有值。

To limit the size of the files I want to change the format to hdf5 and write the data with h5py.为了限制文件的大小，我想将格式更改为 hdf5 并使用 h5py 写入数据。

So very short it should look like this:非常短，应该是这样的：

class StoreData(threading.Thread):

    def __init__(self):
        super().__init__()
        self.f = open_hdf5_file_as_write()

    def run(self):
        while True:
            # return True every +- 0.001 seconds
            if self.new_values_available():
                vals = self.get_new_vals()
                # What to do best with the vals here?

But I stumble upon 2 questions.但我偶然发现了两个问题。

What is the best structure of the HDF5 file? HDF5 文件的最佳结构是什么？ Is it best to store the streams in different groups, or just different datasets in the same group?最好将流存储在不同的组中，还是将不同的数据集存储在同一组中？
How should I write the data?我应该如何写数据？ Do I expand every iteration the datasets with 1 variable using a resize?我是否使用调整大小来扩展具有 1 个变量的数据集的每次迭代？ Do I locally store data and update every n iterations with a chunk of n values per stream or do I keep everything in a pandas table and write it just once at the end?我是在本地存储数据并使用每个 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的 n 值块更新每 n 次迭代，还是将所有内容保存在 pandas 表中并在最后只写入一次？

Answering 1 of the 2 questions would already be a big help!回答 2 个问题中的 1 个已经很有帮助了！

Answer 1

Both are good questions.两个都是好问题。 I can't give a precise answer without knowing more about your data and workflows.如果不了解您的数据和工作流程，我无法给出准确的答案。 (Note: The HDF Group has a good overview you might want to review here: Introduction to HDF5 . It is a good place to learn the possibilities with schema design.) Here are things I would consider in a "thought experiment": （注意： HDF 小组有一个很好的概述，您可能想在这里查看： HDF5 简介。这是学习模式设计可能性的好地方。）以下是我在“思想实验”中会考虑的事情：

The best structure:最佳结构：
With HDF5, you can define any schema you want (within limits), so the best structure (schema), is the one that works best with your data and processes.使用 HDF5，您可以定义任何您想要的架构（在限制范围内），因此最佳结构（架构）是最适合您的数据和流程的架构。

Since you have an existing CSV file format, the simplest approach is creating an equivalent NumPy dtype, and referencing it to create a recarray that holds the data.由于您有一个现有的 CSV 文件格式，最简单的方法是创建一个等效的 NumPy dtype，并引用它来创建一个包含数据的recarray。 This would mimic your current data organization.这将模仿您当前的数据组织。 If you want to get fancier, here are other considerations:如果你想变得更漂亮，这里有其他注意事项：
Your datatypes: are they homogeneous (all floats or all ints), or heterogeneous (a mix of floats, ints and strings)?您的数据类型：它们是同质的（所有浮点数或所有整数）还是异构的（浮点数、整数和字符串的混合）？ You have more options if they are all the same.如果它们都相同，您将有更多选择。 However, HDF5 also supports mixed types as compound data.但是，HDF5 也支持混合类型作为复合数据。
Organization: How are you going to use the data?组织：您将如何使用这些数据？ A properly designed schema will help you avoid data gymnastics in the future.正确设计的模式将帮助您在未来避免数据体操。 Is it advantageous (to you) to save everything in 1 dataset, or to distribute across different datasets/groups?将所有内容保存在 1 个数据集中或分布在不同的数据集/组中是否有利（对您）？ Think of data organized in folders and files on your computer.想想在您计算机上的文件夹和文件中组织的数据。 HDF5 Groups are your folders and the datasets are your files. HDF5 组是您的文件夹，数据集是您的文件。
Convenience of working with the data: similar to organization.使用数据的便利性：类似于组织。 How easy/hard it is to write vs read it.写与读的难易程度。 It might be easier to write it as you get it - but is that a convenient format when you want to process it?当你得到它时编写它可能更容易 - 但是当你想要处理它时这是一种方便的格式吗？

How should I write the data?我应该如何写数据？
There are several Python packages that can write HDF5 data.有几个Python包可以写HDF5数据。 I am familiar with PyTables (aka tables) and h5py .我熟悉PyTables （又名表）和h5py 。 (Pandas can also create HDF5 files, but I have no experience to share.) Both packages have similar capabilities, and some differences. （Pandas 也可以创建 HDF5 文件，但我没有经验可分享。）这两个包的功能相似，但也有一些差异。 Both support HDF5 features you need (resizeable datasets, homogeneous and/or heterogeneous data).两者都支持您需要的 HDF5 功能（可调整大小的数据集、同质和/或异构数据）。 h5py attempts to map the HDF5 feature set to NumPy as closely as possible. h5py尝试尽可能接近 map 将 HDF5 功能集设置为 NumPy。 PyTables has an abstraction layer on top of HDF5 and NumPy, with advanced indexing capabilities to quickly perform in-kernel data queries. PyTables在 HDF5 和 NumPy 之上有一个抽象层，具有高级索引功能，可以快速执行内核数据查询。 (Also, I found PyTables I/O is slightly faster than h5py.) For those reasons, I prefer PyTables, but I am equally comfortable with h5py. （另外，我发现 PyTables I/O 比 h5py 稍快。）出于这些原因，我更喜欢 PyTables，但我对 h5py 也同样满意。

How often should I write: every 1 or N iterations, or once at the end?我应该多久写一次：每 1 或 N 次迭代，还是最后一次？
This is a trade-off of available RAM vs required I/O performance vs coding complexity.这是可用 RAM 与所需 I/O 性能与编码复杂性之间的权衡。 There is an I/O "time cost" with each write to the file.每次写入文件都会产生 I/O“时间成本”。 So, the fastest process is to save all data in RAM and write at the end.因此，最快的过程是将所有数据保存在 RAM 中并在最后写入。 That means you need enough memory to hold a 15 minute datastream.这意味着您需要足够的 memory 来保存 15 分钟的数据流。 I suspect memory requirements will drive this decision.我怀疑 memory 要求将推动这一决定。 The good news: PyTables and h5py will support any of these methods.好消息：PyTables 和 h5py 将支持任何这些方法。

使用 python 将数据流存储在 hdf5 文件中

问题描述

1 个解决方案

解决方案1
1 2021-03-23 20:18:48

使用 python 将数据流存储在 hdf5 文件中

问题描述

1 个解决方案

解决方案1 1 2021-03-23 20:18:48

解决方案1
1 2021-03-23 20:18:48