简体   繁体   中英

numpy time series data - vectorized fill gaps and calculate deltas

I've read some data in from csv using genfromtxt and hstack to concatenate the data which results in a shape of (5413260,) (it takes about 17min, ~1GB .npy save file)

The data is in the format:

timedelta64 1, temp1A, temp 1B, temp1C, ...
timedelta64 2, temp2A, temp 2B, temp2C, ...


>>> data[1:3]
array([ ('2009-01-01T18:41:00', 755, 855, 755, 855, 743, 843, 743, 843, 2),
       ('2009-01-01T18:43:45', 693, 793, 693, 793, 693, 793, 693, 793, 1)],
      dtype=[('datetime', '<M8[s]'), ('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')])

I'd like to do deltas on the temps:

timedelta64 1, temp1A - temp1B, temp 1B - temp1C, ...

and fills:

timedelta64 2 - timedelta64 1 <= sample rate, otherwise fill with a stub with the appropriate time stamp:

timedelta64 1 + shift, 0, 0, 0, CONSTANT, ...

I'm currently:

  1. iterating through the numpy arrayA (arrayA[i], arrayA[i+1])
  2. calculate the delta for row_i, append to numpy arrayB
  3. calculate the time difference between row_i+1 and row_i
  4. iteratively add to shift to timestamp, fill with zeros/constant, append to numpy arrayB

This is highly inefficient - its taken over 12 hours so far and I expect it will take ~ 100+ days to complete.

Whats the vectorized approach?

I'm thinking a vector op to calculate the deltas first, then I'm not sure how to quickly batch and insert the fills for the missing timestamps.

Also, is it faster to reshape -> diff -> fill or reshape -> fill -> diff?

Aside: this is for pre-processing data for machine learning with tensorflow, is there a better tool than numpy?

Since I'm using genfromtxt and heterogenous dtypes , vectorized operations are accomplished through named columns: to slice columns in a tuple present in a numpy array

Generating a range of numpy.datetime64: How can I make a python numpy arange of datetime

Concatenating large arrays in numpy is slow, its best to use pre-allocated array and fill in using slices: How to add items into a numpy array

Then how to merge two structured / record arrays based on matching datetime64 and masking the appropriate fields. Which is found here: Compare two numpy arrays by first Column and create a third numpy array by concatenating two arrays

Overall speedup looks like 100+ days => <5 min (28,800x faster). The pre-allocated array should also speed up loading from csv.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM