I've read some data in from csv using genfromtxt
and hstack
to concatenate the data which results in a shape of (5413260,)
(it takes about 17min, ~1GB .npy save file)
The data is in the format:
timedelta64 1, temp1A, temp 1B, temp1C, ...
timedelta64 2, temp2A, temp 2B, temp2C, ...
>>> data[1:3]
array([ ('2009-01-01T18:41:00', 755, 855, 755, 855, 743, 843, 743, 843, 2),
('2009-01-01T18:43:45', 693, 793, 693, 793, 693, 793, 693, 793, 1)],
dtype=[('datetime', '<M8[s]'), ('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')])
I'd like to do deltas on the temps:
timedelta64 1, temp1A - temp1B, temp 1B - temp1C, ...
and fills:
timedelta64 2 - timedelta64 1 <= sample rate, otherwise fill with a stub with the appropriate time stamp:
timedelta64 1 + shift, 0, 0, 0, CONSTANT, ...
I'm currently:
This is highly inefficient - its taken over 12 hours so far and I expect it will take ~ 100+ days to complete.
Whats the vectorized approach?
I'm thinking a vector op to calculate the deltas first, then I'm not sure how to quickly batch and insert the fills for the missing timestamps.
Also, is it faster to reshape -> diff -> fill or reshape -> fill -> diff?
Aside: this is for pre-processing data for machine learning with tensorflow, is there a better tool than numpy?
Since I'm using genfromtxt
and heterogenous dtypes
, vectorized operations are accomplished through named columns: to slice columns in a tuple present in a numpy array
Generating a range of numpy.datetime64: How can I make a python numpy arange of datetime
Concatenating large arrays in numpy is slow, its best to use pre-allocated array and fill in using slices: How to add items into a numpy array
Then how to merge two structured / record arrays based on matching datetime64 and masking the appropriate fields. Which is found here: Compare two numpy arrays by first Column and create a third numpy array by concatenating two arrays
Overall speedup looks like 100+ days => <5 min (28,800x faster). The pre-allocated array should also speed up loading from csv.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.