簡體   English   中英

從二進制文件創建Numpy數組的有效方法

[英]Efficient Way to Create Numpy Arrays from Binary Files

我有非常大的數據集存儲在硬盤上的二進制文件中。 以下是文件結構的示例:

文件頭

149 Byte ASCII Header

記錄開始

4 Byte Int - Record Timestamp

樣品開始

2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample

樣品結束

每條記錄有122,880個樣本,每個文件有713個記錄。 這樣產生的總大小為700,910,521字節。 采樣率和記錄數有時會有所不同,因此我必須編碼以檢測每個文件的數量。

目前我用來將這些數據導入數組的代碼如下:

from time import clock
from numpy import zeros , int16 , int32 , hstack , array , savez
from struct import unpack
from os.path import getsize

start_time = clock()
file_size = getsize(input_file)

with open(input_file,'rb') as openfile:
  input_data = openfile.read()

header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)

for record in xrange(number_of_records):

  time_stamp = array( unpack( '<l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 )
  unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] ) 

  record_t = zeros(sample_rate , dtype=int16)
  record_x = zeros(sample_rate , dtype=int16)
  record_y = zeros(sample_rate , dtype=int16)
  record_z = zeros(sample_rate , dtype=int16)

  for sample in xrange(sample_rate):

    record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
    record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
    record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
    record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]

  time_series = hstack ( ( time_series , time_stamp ) )
  t_series = hstack ( ( t_series , record_t ) )
  x_series = hstack ( ( x_series , record_x ) )
  y_series = hstack ( ( y_series , record_y ) )
  z_series = hstack ( ( z_series , record_z ) )

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'

這目前每700 MB文件大約需要250秒,這對我來說似乎非常高。 有沒有更有效的方法可以做到這一點?

最終解決方案

使用帶有自定義dtype的numpy fromfile方法將運行時間縮短為9秒,比上面的原始代碼快27倍。 最終代碼如下。

from numpy import savez, dtype , fromfile 
from os.path import getsize
from time import clock

start_time = clock()
file_size = getsize(input_file)

openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

record_dtype = dtype( [ ( 'timestamp' , '<i4' ) , ( 'samples' , '<i2' , ( sample_rate , 4 ) ) ] )

data = fromfile(openfile , dtype = record_dtype , count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series)

end_time = clock()

print 'It took',end_time - start_time,'seconds'

一些提示:

像這樣的東西(未經測試,但你明白了):

import numpy as np

file = open(input_file, 'rb')
header = file.read(149)

# ... parse the header as you did ...

record_dtype = np.dtype([
    ('timestamp', '<i4'), 
    ('samples', '<i2', (sample_rate, 4))
])

data = np.fromfile(file, dtype=record_dtype, count=number_of_records)
# NB: count can be omitted -- it just reads the whole file then

time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

Numpy支持通過numpy.memmap將二進制數據從數據直接映射到類似對象的數組中。 您可以將memmap存檔並通過偏移量提取所需的數據。

對於字節順序正確性,只需在讀入的內容上使用numpy.byteswap。您可以使用條件表達式來檢查主機系統的字節順序:

if struct.pack('=f', np.pi) == struct.pack('>f', np.pi):
  # Host is big-endian, in-place conversion
  arrayName.byteswap(True)

一個明顯的低效率是在循環中使用hstack

  time_series = hstack ( ( time_series , time_stamp ) )
  t_series = hstack ( ( t_series , record_t ) )
  x_series = hstack ( ( x_series , record_x ) )
  y_series = hstack ( ( y_series , record_y ) )
  z_series = hstack ( ( z_series , record_z ) )

在每次迭代中,這為每個系列分配一個稍大的數組,並將迄今為止所讀取的所有數據復制到其中。 這涉及許多不必要的復制,並可能導致不良的內存碎片。

我積累的值time_stamp在列表中,做一個hstack底,並會做正是為了同樣的record_t等。

如果這沒有帶來足夠的性能改進,我會注釋掉循環的主體並開始一次性地重新啟動,以查看確切花費的時間。

通過使用arraystruct.unpack ,我得到了類似問題(多分辨率多通道二進制數據文件)的滿意結果。 在我的問題中,我想要每個通道的連續數據,但該文件具有面向間隔的結構,而不是面向通道的結構。

“秘密”是首先讀取整個文件,然后將已知大小的切片分發到所需的容器(在下面的代碼中, self.channel_content[channel]['recording']是一個類型為array的對象):

f = open(somefilename, 'rb')    
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(wholefilename)/2 - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
    for channel in self.channel_labels:
        samples = int(self.channel_content[channel]['nsamples'])
        self.channel_content[channel]['recording'].extend(fullsamples[position:position+samples])
            position += samples

當然,我不能說這比提供的其他答案更好或更快,但至少是你可能評估的東西。

希望能幫助到你!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM