[英]Storing datastream in hdf5 file using python
I have a python program that accepts a stream of data via UDP at a rate of +- 1000 Hz.我有一个 python 程序,它通过 UDP 以 +- 1000 Hz 的速率接受 stream 数据。 A typical stream takes +- 15 mins.典型的 stream 需要 +- 15 分钟。 It consists of +- 10 channels each consisting of a stream of doubles, booleans or vector of size 3 with a timestamp.它由 +- 10 个通道组成,每个通道由一个 stream 组成,该通道由双精度数、布尔值或大小为 3 的向量组成,带有时间戳。
Currently every iteration (so 1000 times a second) it writes a line to a csv file with all the values.目前,每次迭代(每秒 1000 次)它都会将一行写入 csv 文件,其中包含所有值。
To limit the size of the files I want to change the format to hdf5 and write the data with h5py.为了限制文件的大小,我想将格式更改为 hdf5 并使用 h5py 写入数据。
So very short it should look like this:非常短,应该是这样的:
class StoreData(threading.Thread):
def __init__(self):
super().__init__()
self.f = open_hdf5_file_as_write()
def run(self):
while True:
# return True every +- 0.001 seconds
if self.new_values_available():
vals = self.get_new_vals()
# What to do best with the vals here?
But I stumble upon 2 questions.但我偶然发现了两个问题。
What is the best structure of the HDF5 file? HDF5 文件的最佳结构是什么? Is it best to store the streams in different groups, or just different datasets in the same group?最好将流存储在不同的组中,还是将不同的数据集存储在同一组中?
How should I write the data?我应该如何写数据? Do I expand every iteration the datasets with 1 variable using a resize?我是否使用调整大小来扩展具有 1 个变量的数据集的每次迭代? Do I locally store data and update every n iterations with a chunk of n values per stream or do I keep everything in a pandas table and write it just once at the end?我是在本地存储数据并使用每个 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的 n 值块更新每 n 次迭代,还是将所有内容保存在 pandas 表中并在最后只写入一次?
Answering 1 of the 2 questions would already be a big help!回答 2 个问题中的 1 个已经很有帮助了!
Both are good questions.两个都是好问题。 I can't give a precise answer without knowing more about your data and workflows.如果不了解您的数据和工作流程,我无法给出准确的答案。 (Note: The HDF Group has a good overview you might want to review here: Introduction to HDF5 . It is a good place to learn the possibilities with schema design.) Here are things I would consider in a "thought experiment": (注意: HDF 小组有一个很好的概述,您可能想在这里查看: HDF5 简介。这是学习模式设计可能性的好地方。)以下是我在“思想实验”中会考虑的事情:
The best structure:最佳结构:
With HDF5, you can define any schema you want (within limits), so the best structure (schema), is the one that works best with your data and processes.使用 HDF5,您可以定义任何您想要的架构(在限制范围内),因此最佳结构(架构)是最适合您的数据和流程的架构。
How should I write the data?我应该如何写数据?
There are several Python packages that can write HDF5 data.有几个Python包可以写HDF5数据。 I am familiar with PyTables (aka tables) and h5py .我熟悉PyTables (又名表)和h5py 。 (Pandas can also create HDF5 files, but I have no experience to share.) Both packages have similar capabilities, and some differences. (Pandas 也可以创建 HDF5 文件,但我没有经验可分享。)这两个包的功能相似,但也有一些差异。 Both support HDF5 features you need (resizeable datasets, homogeneous and/or heterogeneous data).两者都支持您需要的 HDF5 功能(可调整大小的数据集、同质和/或异构数据)。 h5py attempts to map the HDF5 feature set to NumPy as closely as possible. h5py尝试尽可能接近 map 将 HDF5 功能集设置为 NumPy。 PyTables has an abstraction layer on top of HDF5 and NumPy, with advanced indexing capabilities to quickly perform in-kernel data queries. PyTables在 HDF5 和 NumPy 之上有一个抽象层,具有高级索引功能,可以快速执行内核数据查询。 (Also, I found PyTables I/O is slightly faster than h5py.) For those reasons, I prefer PyTables, but I am equally comfortable with h5py. (另外,我发现 PyTables I/O 比 h5py 稍快。)出于这些原因,我更喜欢 PyTables,但我对 h5py 也同样满意。
How often should I write: every 1 or N iterations, or once at the end?我应该多久写一次:每 1 或 N 次迭代,还是最后一次?
This is a trade-off of available RAM vs required I/O performance vs coding complexity.这是可用 RAM 与所需 I/O 性能与编码复杂性之间的权衡。 There is an I/O "time cost" with each write to the file.每次写入文件都会产生 I/O“时间成本”。 So, the fastest process is to save all data in RAM and write at the end.因此,最快的过程是将所有数据保存在 RAM 中并在最后写入。 That means you need enough memory to hold a 15 minute datastream.这意味着您需要足够的 memory 来保存 15 分钟的数据流。 I suspect memory requirements will drive this decision.我怀疑 memory 要求将推动这一决定。 The good news: PyTables and h5py will support any of these methods.好消息:PyTables 和 h5py 将支持任何这些方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.