從二進制文件創建Numpy數組的有效方法

Question

我有非常大的數據集存儲在硬盤上的二進制文件中。 以下是文件結構的示例：

文件頭

149 Byte ASCII Header

記錄開始

4 Byte Int - Record Timestamp

樣品開始

2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample

樣品結束

每條記錄有122,880個樣本，每個文件有713個記錄。 這樣產生的總大小為700,910,521字節。 采樣率和記錄數有時會有所不同，因此我必須編碼以檢測每個文件的數量。

目前我用來將這些數據導入數組的代碼如下：

from time import clock
from numpy import zeros , int16 , int32 , hstack , array , savez
from struct import unpack
from os.path import getsize

start_time = clock()
file_size = getsize(input_file)

with open(input_file,'rb') as openfile:
  input_data = openfile.read()

header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)

for record in xrange(number_of_records):

  time_stamp = array( unpack( '<l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 )
  unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] ) 

  record_t = zeros(sample_rate , dtype=int16)
  record_x = zeros(sample_rate , dtype=int16)
  record_y = zeros(sample_rate , dtype=int16)
  record_z = zeros(sample_rate , dtype=int16)

  for sample in xrange(sample_rate):

    record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
    record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
    record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
    record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]

  time_series = hstack ( ( time_series , time_stamp ) )
  t_series = hstack ( ( t_series , record_t ) )
  x_series = hstack ( ( x_series , record_x ) )
  y_series = hstack ( ( y_series , record_y ) )
  z_series = hstack ( ( z_series , record_z ) )

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'

這目前每700 MB文件大約需要250秒，這對我來說似乎非常高。 有沒有更有效的方法可以做到這一點？

最終解決方案

使用帶有自定義dtype的numpy fromfile方法將運行時間縮短為9秒，比上面的原始代碼快27倍。 最終代碼如下。

from numpy import savez, dtype , fromfile 
from os.path import getsize
from time import clock

start_time = clock()
file_size = getsize(input_file)

openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

record_dtype = dtype( [ ( 'timestamp' , '<i4' ) , ( 'samples' , '<i2' , ( sample_rate , 4 ) ) ] )

data = fromfile(openfile , dtype = record_dtype , count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series)

end_time = clock()

print 'It took',end_time - start_time,'seconds'

Answer 1

一些提示：

不要使用struct模塊。 相反，使用Numpy的結構化數據類型和fromfile 。 點擊此處： http ： //scipy-lectures.github.com/advanced/advanced_numpy/index.html#example-reading-wav-files
您可以通過傳入適當的count = to fromfile來一次讀取所有記錄。

像這樣的東西（未經測試，但你明白了）：

import numpy as np

file = open(input_file, 'rb')
header = file.read(149)

# ... parse the header as you did ...

record_dtype = np.dtype([
    ('timestamp', '<i4'), 
    ('samples', '<i2', (sample_rate, 4))
])

data = np.fromfile(file, dtype=record_dtype, count=number_of_records)
# NB: count can be omitted -- it just reads the whole file then

time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

Answer 2

Numpy支持通過numpy.memmap將二進制數據從數據直接映射到類似對象的數組中。 您可以將memmap存檔並通過偏移量提取所需的數據。

對於字節順序正確性，只需在讀入的內容上使用numpy.byteswap。您可以使用條件表達式來檢查主機系統的字節順序：

if struct.pack('=f', np.pi) == struct.pack('>f', np.pi):
  # Host is big-endian, in-place conversion
  arrayName.byteswap(True)

Answer 3

一個明顯的低效率是在循環中使用hstack ：

  time_series = hstack ( ( time_series , time_stamp ) )
  t_series = hstack ( ( t_series , record_t ) )
  x_series = hstack ( ( x_series , record_x ) )
  y_series = hstack ( ( y_series , record_y ) )
  z_series = hstack ( ( z_series , record_z ) )

在每次迭代中，這為每個系列分配一個稍大的數組，並將迄今為止所讀取的所有數據復制到其中。 這涉及許多不必要的復制，並可能導致不良的內存碎片。

我積累的值time_stamp在列表中，做一個hstack底，並會做正是為了同樣的record_t等。

如果這沒有帶來足夠的性能改進，我會注釋掉循環的主體並開始一次性地重新啟動，以查看確切花費的時間。

Answer 4

通過使用array和struct.unpack ，我得到了類似問題（多分辨率多通道二進制數據文件）的滿意結果。 在我的問題中，我想要每個通道的連續數據，但該文件具有面向間隔的結構，而不是面向通道的結構。

“秘密”是首先讀取整個文件，然后將已知大小的切片分發到所需的容器（在下面的代碼中， self.channel_content[channel]['recording']是一個類型為array的對象）：

f = open(somefilename, 'rb')    
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(wholefilename)/2 - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
    for channel in self.channel_labels:
        samples = int(self.channel_content[channel]['nsamples'])
        self.channel_content[channel]['recording'].extend(fullsamples[position:position+samples])
            position += samples

當然，我不能說這比提供的其他答案更好或更快，但至少是你可能評估的東西。

希望能幫助到你！

從二進制文件創建Numpy數組的有效方法

問題描述

最終解決方案

4 個解決方案

解決方案1
15 已采納 2011-09-27 22:28:32

解決方案2
2 2011-09-27 13:08:23

解決方案3
2 2011-09-27 13:11:48

解決方案4
0 2011-09-27 13:50:32

從二進制文件創建Numpy數組的有效方法

問題描述

最終解決方案

4 個解決方案

解決方案1 15 已采納 2011-09-27 22:28:32

解決方案2 2 2011-09-27 13:08:23

解決方案3 2 2011-09-27 13:11:48

解決方案4 0 2011-09-27 13:50:32

解決方案1
15 已采納 2011-09-27 22:28:32

解決方案2
2 2011-09-27 13:08:23

解決方案3
2 2011-09-27 13:11:48

解決方案4
0 2011-09-27 13:50:32