[英]Numpy: Efficient way to create a complex array from two real arrays
[英]Efficient Way to Create Numpy Arrays from Binary Files
我有非常大的數據集存儲在硬盤上的二進制文件中。 以下是文件結構的示例:
文件頭
149 Byte ASCII Header
記錄開始
4 Byte Int - Record Timestamp
樣品開始
2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample
樣品結束
每條記錄有122,880個樣本,每個文件有713個記錄。 這樣產生的總大小為700,910,521字節。 采樣率和記錄數有時會有所不同,因此我必須編碼以檢測每個文件的數量。
目前我用來將這些數據導入數組的代碼如下:
from time import clock
from numpy import zeros , int16 , int32 , hstack , array , savez
from struct import unpack
from os.path import getsize
start_time = clock()
file_size = getsize(input_file)
with open(input_file,'rb') as openfile:
input_data = openfile.read()
header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2
time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)
for record in xrange(number_of_records):
time_stamp = array( unpack( '<l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 )
unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] )
record_t = zeros(sample_rate , dtype=int16)
record_x = zeros(sample_rate , dtype=int16)
record_y = zeros(sample_rate , dtype=int16)
record_z = zeros(sample_rate , dtype=int16)
for sample in xrange(sample_rate):
record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]
time_series = hstack ( ( time_series , time_stamp ) )
t_series = hstack ( ( t_series , record_t ) )
x_series = hstack ( ( x_series , record_x ) )
y_series = hstack ( ( y_series , record_y ) )
z_series = hstack ( ( z_series , record_z ) )
savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'
這目前每700 MB文件大約需要250秒,這對我來說似乎非常高。 有沒有更有效的方法可以做到這一點?
使用帶有自定義dtype的numpy fromfile方法將運行時間縮短為9秒,比上面的原始代碼快27倍。 最終代碼如下。
from numpy import savez, dtype , fromfile
from os.path import getsize
from time import clock
start_time = clock()
file_size = getsize(input_file)
openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2
record_dtype = dtype( [ ( 'timestamp' , '<i4' ) , ( 'samples' , '<i2' , ( sample_rate , 4 ) ) ] )
data = fromfile(openfile , dtype = record_dtype , count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()
savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series)
end_time = clock()
print 'It took',end_time - start_time,'seconds'
一些提示:
不要使用struct模塊。 相反,使用Numpy的結構化數據類型和fromfile
。 點擊此處: http : //scipy-lectures.github.com/advanced/advanced_numpy/index.html#example-reading-wav-files
您可以通過傳入適當的count = to fromfile
來一次讀取所有記錄。
像這樣的東西(未經測試,但你明白了):
import numpy as np file = open(input_file, 'rb') header = file.read(149) # ... parse the header as you did ... record_dtype = np.dtype([ ('timestamp', '<i4'), ('samples', '<i2', (sample_rate, 4)) ]) data = np.fromfile(file, dtype=record_dtype, count=number_of_records) # NB: count can be omitted -- it just reads the whole file then time_series = data['timestamp'] t_series = data['samples'][:,:,0].ravel() x_series = data['samples'][:,:,1].ravel() y_series = data['samples'][:,:,2].ravel() z_series = data['samples'][:,:,3].ravel()
Numpy支持通過numpy.memmap將二進制數據從數據直接映射到類似對象的數組中。 您可以將memmap存檔並通過偏移量提取所需的數據。
對於字節順序正確性,只需在讀入的內容上使用numpy.byteswap。您可以使用條件表達式來檢查主機系統的字節順序:
if struct.pack('=f', np.pi) == struct.pack('>f', np.pi):
# Host is big-endian, in-place conversion
arrayName.byteswap(True)
一個明顯的低效率是在循環中使用hstack
:
time_series = hstack ( ( time_series , time_stamp ) )
t_series = hstack ( ( t_series , record_t ) )
x_series = hstack ( ( x_series , record_x ) )
y_series = hstack ( ( y_series , record_y ) )
z_series = hstack ( ( z_series , record_z ) )
在每次迭代中,這為每個系列分配一個稍大的數組,並將迄今為止所讀取的所有數據復制到其中。 這涉及許多不必要的復制,並可能導致不良的內存碎片。
我積累的值time_stamp
在列表中,做一個hstack
底,並會做正是為了同樣的record_t
等。
如果這沒有帶來足夠的性能改進,我會注釋掉循環的主體並開始一次性地重新啟動,以查看確切花費的時間。
通過使用array
和struct.unpack
,我得到了類似問題(多分辨率多通道二進制數據文件)的滿意結果。 在我的問題中,我想要每個通道的連續數據,但該文件具有面向間隔的結構,而不是面向通道的結構。
“秘密”是首先讀取整個文件,然后將已知大小的切片分發到所需的容器(在下面的代碼中, self.channel_content[channel]['recording']
是一個類型為array
的對象):
f = open(somefilename, 'rb')
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(wholefilename)/2 - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(fullsamples[position:position+samples])
position += samples
當然,我不能說這比提供的其他答案更好或更快,但至少是你可能評估的東西。
希望能幫助到你!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.