简体繁体 English

实时数据收集和“离线”处理

[英]Real-time data collection and 'offline' processing

原文 2016-05-08 06:23:20 9 1 python/ read-write/ data-stream

I have a continuous stream of data. 我有连续的数据流。 I want to do a small amount of processing to the data in real-time (mostly just compression, rolling some data off the end, whatever needs doing) and then store the data. 我想实时地对数据进行少量处理（主要是压缩，将某些数据从末端滚动，无论需要做什么），然后存储数据。 Presumably no problem. 大概没问题。 HDF5 file format should do great! HDF5文件格式应该很棒！ OOC data, no problem. OOC数据，没问题。 Pytables. Pytables。

Now the trouble. 现在麻烦了。 Occasionally, as a completely separate process so that data is still being gathered, I would like to perform a time consuming calculation involving the data (order minutes). 有时，作为一个完全独立的过程，以便仍在收集数据，我想执行一个涉及数据（订单分钟）的耗时计算。 This involving reading the same file I'm writing. 这涉及读取我正在编写的相同文件。

How do people do this? 人们如何做到这一点？

Of course reading a file that you're currently writing should be challenging, but it seems that it must have come up enough in the past that people have considering some sort of slick solution---or at least a natural work-around. 当然，读取您当前正在编写的文件应该具有挑战性，但是似乎过去它已经足够成熟，人们已经在考虑某种巧妙的解决方案-或至少是一种自然的解决方法。

Partial solutions: 部分解决方案：

It seems that HDF5-1.10.0 has a capability SWMR - Single Write, Multiple Read. HDF5-1.10.0似乎具有SWMR-单写入，多读取的功能。 This seems like exactly what I want. 这似乎正是我想要的。 I can't find a python wrapper for this recent version, or if it exists I can't get Python to talk to the right version of hdf5. 我找不到此最新版本的python包装器，或者如果存在，则无法让Python与正确版本的hdf5对话。 Any tips here would be welcomed. 这里的任何提示都将受到欢迎。 I'm using Conda package manager. 我正在使用Conda软件包管理器。
I could imagine writing to a buffer, which is occasionally flushed and added to the large database. 我可以想象写入一个缓冲区，该缓冲区偶尔会刷新并添加到大型数据库中。 How do I ensure that I'm not missing data going by while doing this? 如何确保在执行此操作时不会丢失数据？

This also seems like it might be computationally expensive, but perhaps there's no getting around that. 这似乎在计算上也可能很昂贵，但是也许没有解决的办法。

Collect less data. 收集更少的数据。 What's the fun in that? 那有什么乐趣？

1 个解决方案

I suggest you take a look at adding Apache Kafka to your pipeline, it can act as a data buffer and help you separate different tasks done on the data you collect. 我建议您看一下将Apache Kafka添加到管道中的过程，它可以充当数据缓冲区，并帮助您分离对收集的数据完成的不同任务。

pipeline example: 管道示例：

raw data ===> kafka topic (raw_data) ===> small processing ====> kafak topic (light_processing) ===> a process read from light_processing topic and writes to db or file 原始数据===> kafka主题（raw_data）===>小处理====> kafak主题（light_processing）===>从light_processing主题读取并写入db或文件的进程

At the same time you can read with another process the same data from light_processing topic or any other topic and do your heavy processing and so on. 同时，您可以使用另一个进程从light_processing topic或任何其他主题读取相同的数据，并进行繁重的处理等。

if both the light processing and the heavy processing connect to kafka topic with the same groupId the data will be replicated and both processes will get the same stream 如果light processing和heavy processing使用相同的groupId连接到kafka主题，则将复制数据，并且两个进程将获得相同的流

hope it helped. 希望能有所帮助。