简体   繁体   中英

Suggestions on how to store and retrieve time-series data

I am currently working on a project that requires us to store a large amount of time series data, but more importantly, retrieve large amounts of it quick.

There will be N devices (>10,000) which will periodically send data to the system, lets say every 5 seconds. This data will quickly build up, but we are generally only interested in the most recent data, and want to compact the older data. We don't want to remove it, as it is still useful, but instead of having thousands of data point for a day, we might save just 5 or 10 after N days/weeks/months have passed.

Specifically we want to be able to fetch sampled data over a large time period, say a year or two. There might be millions of points here, but we just want a small, linearly distributed, sample of this data.

Today we are experimenting with influxdb, which initially seemed like an alright solution. It was fast enough and allows us to store our data in a reasonable structure, but we have found that it is not completely satisfactory. We were unable to perform the sample query described above and in general the system does not feel mature enough for us.

Any advice on how we can proceed, or alternative solutions, is much appreciated.

You might be interested in looking at TimescaleDB:

https://github.com/timescale/timescaledb

It builds a time-series DB on top of Postgres and so offers full SQL support, as well as generally the Postgres ecosystem/reliability. This can give you a lot greater query flexibility, which sounds like you want.

In terms of your specific use case, there would really be two solutions.

First, what people typically would do is to create two "hypertables", one for raw data, another for sampled data. These hypertables look like standard tables to the user, although heavily partitioned under the covers for much better scalability (eg, 20x insert throughput vs. postgres for large table sizes).

Then you basically do a roll-up from the raw to the sampled table, and use a different data retention policy on each (so you keep raw data for say 1 month, with sampled data for years).

http://docs.timescale.com/getting-started/setup/starting-from-scratch http://docs.timescale.com/api/data-retention

Second, you can go with a single hypertable, and then just schedule a normal SQL query to delete individual rows from data that's older than a certain time period.

We might even in the future add better first-class support for this latter approach if it becomes a common-enough requested feature, although most use cases we've encountered to date seemed more focused on #1, esp. in order to to keep statistical data about removed data-points, as opposed to just straight samples.

(Disclaimer: I'm one of the authors of TimescaleDB.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM