简体   繁体   中英

How to handle collection and analysis of arbitrary timeseries data (data stream mining)

At our hackerspace, we have several environmental sensors and event trackers (such as # connected devices, heating, bar transactions, etc.) that output timeseries data at regular intervals. The output of our current platform consists of a unix timestamp + value/event. The intervals at which these are polled are different for each probe.

The goal is to collect this data in one dataset for

  1. efficient storage
  2. online analysis (using scikit)
  3. streaming visualization (using bokeh)
  4. handle both real-values and discrete numeric data in a integrated manner
  5. (preferably using Python but this is not a requirement.)

What is a good practical approach the achieve the above goals? Are there existing libraries that provide this functionality?

The current (imperfect) plan:

  • Integrate timeseries object and integrate them in a numpy array or pandas timeseries dataframe.
  • Update x-axis by the smallest available time interval and set missing datapoints to NaN for sensors with a larger interval.
  • NaN values can later be interpolated/convolved.

However, this would result in a dataset with a majority of NaN values and that comes with its own statistical and possibly storage problems. Another option is to predetermine an median interval and store that losing some data.

Time-series databases have shown to be the correct answer after some further searching. I plan on using OpenTSDB as it seems the most developed out of the available timeseries databases.

This solves the storage and interval querying issues as these are built into the database management system. Then it is just a matter of visualization with Bokeh.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM