简体繁体 English

收集，存储和检索大量数字数据

[英]Collecting, storing, and retrieving large amounts of numeric data

原文 2010-11-04 15:51:53 3 6 java/ c++/ python/ storage/ simulation

I am about to start collecting large amounts of numeric data in real-time (for those interested, the bid/ask/last or 'tape' for various stocks and futures). 我即将开始实时收集大量数字数据（对于那些感兴趣的人，各种股票和期货的买/卖/卖或“带”）。 The data will later be retrieved for analysis and simulation. 稍后将检索数据以进行分析和模拟。 That's not hard at all, but I would like to do it efficiently and that brings up a lot of questions. 这一点都不难，但是我想高效地做到这一点，这带来了很多问题。 I don't need the best solution (and there are probably many 'bests' depending on the metric, anyway). 我不需要最佳解决方案（无论如何，根据指标可能有很多“最佳”）。 I would just like a solution that a computer scientist would approve of. 我只想得到计算机科学家认可的解决方案。 (Or not laugh at?) （还是不笑？）

(1) Optimize for disk space, I/O speed, or memory? （1）优化磁盘空间，I / O速度或内存？

For simulation, the overall speed is important. 对于仿真而言，整体速度很重要。 We want the I/O (really, I) speed of the data just faster than the computational engine, so we are not I/O limited. 我们希望数据的I / O（实际上是I）速度快于计算引擎，因此我们不受I / O的限制。

(2) Store text, or something else (binary numeric)? （2）存储文本还是其他东西（二进制数字）？

(3) Given a set of choices from (1)-(2), are there any standout language/library combinations to do the job-- Java, Python, C++, or something else? （3）给定（1）-（2）中的一组选择，是否有任何出色的语言/库组合可以完成此任务-Java，Python，C ++或其他？

I would classify this code as "write and forget", so more points for efficiency over clarity/compactness of code. 我将此代码归类为“写后遗忘”，因此在效率上要比代码的清晰度/紧凑性更多。 I would very, very much like to stick with Python for the simulation code (because the sims do change a lot and need to be clear). 我非常非常希望坚持使用Python作为模拟代码（因为模拟的确发生了很大的变化，需要明确）。 So bonus points for good Pythonic solutions. 因此，良好的Python解决方案可获得加分。

Edit: this is for a Linux system (Ubuntu) 编辑：这是针对Linux系统（Ubuntu）

Thanks 谢谢

6 个解决方案

Optimizing for disk space and IO speed is the same thing - these days, CPUs are so fast compared to IO that it's often overall faster to compress data before storing it (you may actually want to do that). 优化磁盘空间和IO速度是同一回事-如今，CPU与IO相比是如此之快，以至于在存储数据之前压缩数据通常总体上更快（您可能确实想这样做）。 I don't really see memory playing a big role (though you should probably use a reasonably-sized buffer to ensure you're doing sequential writes). 我并没有真正看到内存起很大的作用（尽管您可能应该使用合理大小的缓冲区以确保执行顺序写入）。
Binary is more compact (and thus faster). 二进制文件更紧凑（因此速度更快）。 Given the amount of data, I doubt whether being human-readable has any value. 考虑到数据量，我怀疑人类可读是否具有任何价值。 The only advantage of a text format would be that it's easier to figure out and correct if it gets corrupted or you lose the parsing code. 文本格式的唯一优点是，如果损坏或丢失解析代码，则更容易查明和更正。

Fame is an often-used commercial solution for time-series storage. 名望是用于时间序列存储的常用商业解决方案。

If you are serious about this, building your own will be a big job. 如果您对此很认真，那么构建自己的将是一项艰巨的工作。 HDF might be useful, they claim that it is suitable for tick data handling, and have C++ access. HDF可能有用，他们声称它适合刻度数据处理，并且具有C ++访问权限。 There is Python support here . 有Python支持这里。

Useful real-life experience from somebody with the same problem here , including HDF5 refs. 从别人有用的真实体验同样的问题在这里，包括裁判HDF5。

Actually, this is quite similar to what I'm doing, which is monitoring changes players make to the world in a game. 实际上，这与我所做的非常相似，即监视玩家在游戏中对整个世界所做的更改。 I'm currently using an sqlite database with python. 我目前正在使用带sql的sqlite数据库。 At the start of the program, I load the disk database into memory, for fast writing procedures. 在程序开始时，我将磁盘数据库加载到内存中，以进行快速写入过程。 Each change is put in to two lists. 每个更改都放入两个列表中。 These lists are for both the memory database and the disk database. 这些列表同时用于内存数据库和磁盘数据库。 Every x or so updates, the memory database is updated, and a counter is pushed up one. 每x左右更新一次，就更新内存数据库，并向上推一个计数器。 This is repeated, and when the counter equals 5, it's reset and the list with changes for the disk is flushed to the disk database and the list is cleared.I have found this works well if I also set the writing more to WOL(Write Ahead Logging). 重复此过程，当计数器等于5时，它将重置并将具有磁盘更改的列表刷新到磁盘数据库并清除列表。我发现如果我也将写入设置为WOL（Write提前记录）。 This method can stand about 100-300 updates a second if I update memory every 100 updates and the disk counter is set to update every 5 memory updates. 如果我每100次更新就更新一次内存，并且磁盘计数器设置为每5次内存更新一次，则此方法大约每秒可以进行100-300次更新。 You should probobly choose binary, sense, unless you have faults in your data sources, would be most logical 您应该选择合理的二进制文件，除非您的数据源中有错误，否则将是最合乎逻辑的

Using D-Bus format to send the information may be to your advantage. 使用D-Bus格式发送信息可能对您有利。 The format is standard, binary, and D-Bus is implemented in multiple languages, and can be used to send both over the network and inter-process on the same machine. 该格式是标准格式，二进制格式，并且D-Bus以多种语言实现，并且可以用于通过网络发送和同一台机器上的进程间发送。

If you are just storing, then use system tools. 如果您只是存储，则使用系统工具。 Don't write your own. 不要自己写。 If you need to do some real-time processing of the data before it is stored, then that's something completely different. 如果在存储数据之前需要对数据进行一些实时处理，那就完全不同了。

It just occurred to me after reading this thread on storing integers efficiently given certain conditions that we are wasting a lot of bits when we store tick data as doubles or floats or whatever. 在给定条件的情况下，当我们将滴答数据存储为double或float或其他内容时，我们浪费了大量的位时，在读取此线程以有效地存储整数的情况下，这只是我的事。 THE PRICES ARE QUANTIZED! 价格已量化！ And quite severely, at that. 而且非常严重。 For example, yesterday's NQ range was from about 2175-2191, or about 26 points, quantized by 0.25. 例如，昨天的NQ范围是大约2175-2191，或大约26点，被0.25量化。 So that limits the ticks to ~100 different prices. 这样一来，价格波动幅度就限制在100种左右。 See where I'm going with this? 看看我要去哪里？ You only need one byte for each price. 每个价格只需要一个字节。 Stocks are quantized by 0.01 so you'd need ~ 1 byte for each dollar in the daily range. 股票以0.01进行量化，因此每日范围内的每一美元需要大约1个字节。

So the method I'm outlining is: (1) store high price, low price, and increment as one line header (2) store tick data after that as two bytes, with the two left-most bits used to encode the tick type (00 = last, 01 = bid, 11 = ask) 因此，我概述的方法是：（1）将高价，低价和增量存储为一行标题（2）之后将报价数据存储为两个字节，其中最左边的两个位用于编码报价类型（00 =最后，01 =出价，11 =询问）

I think this is something a CS would approve of! 我认为这是CS会赞成的！