從Python中的二進制文件中提取特定字節

Question

我有非常大的二進制文件，其中包含y個傳感器的x個int16數據點，以及帶有一些基本信息的標題。 二進制文件寫為y值，每個采樣時間最多x個樣本，然后是另一組讀數，依此類推。 如果我想要所有的數據，我使用的是numpy.fromfile() ，它非常好用而且快速。 然而，如果我只希望傳感器數據或僅特定傳感器的子集，我現在有一個可怕雙for循環，使用file.seek() file.read()和struct.unpack()即需要永遠。 還有另一種方法可以在python中更快地完成這項工作嗎？ 也許用mmap()我不太懂？ 或者只使用整個fromfile()然后進行二次采樣？

data = numpy.empty(num_pts, sensor_indices)
for i in range(num_pts):
    for j in range(sensor_indices):
        curr_file.seek(bin_offsets[j])
        data_binary = curr_file.read(2)
        data[j][i] = struct.unpack('h', data_binary)[0]

跟隨@rrauenza關於mmap建議，這是很好的信息，我編輯了代碼

mm = mmap.mmap(curr_file.fileno(), 0, access=mmap.ACCESS_READ)
data = numpy.empty(num_pts,sensor_indices)
for i in range(num_pts):
    for j in range(len(sensor_indices)):
        offset += bin_offsets[j] * 2
        data[j][i] = struct.unpack('h', mm[offset:offset+2])[0]

雖然這比以前更快，但它仍然比數量級慢幾個數量級

shape = (x, y)
data = np.fromfile(file=self.curr_file, dtype=np.int16).reshape(shape)
data = data.transpose()
data = data[sensor_indices, :]
data = data[:, range(num_pts)]

我測試了一個較小的30 Mb文件，只有16個傳感器，30秒的數據。 原始代碼為160秒， mmap為105秒， np.fromfile和子采樣為0.33秒。

剩下的問題是 - 顯然使用numpy.fromfile()對於小文件更好，但是會出現更大文件的問題，可能是20 Gb，數小時或數天的數據和多達500個傳感器？

Answer 1

我肯定會嘗試mmap() ：

https://docs.python.org/2/library/mmap.html

如果你正在為你正在提取的每個int16調用seek()和read() ，那么你正在讀取很多小位，這些小位會產生很多系統調用開銷。

我寫了一個小測試來證明：

#!/usr/bin/python

import mmap
import os
import struct
import sys

FILE = "/opt/tmp/random"  # dd if=/dev/random of=/tmp/random bs=1024k count=1024
SIZE = os.stat(FILE).st_size
BYTES = 2
SKIP = 10


def byfile():
    sum = 0
    with open(FILE, "r") as fd:
        for offset in range(0, SIZE/BYTES, SKIP*BYTES):
            fd.seek(offset)
            data = fd.read(BYTES)
            sum += struct.unpack('h', data)[0]
    return sum


def bymmap():
    sum = 0
    with open(FILE, "r") as fd:
        mm = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
        for offset in range(0, SIZE/BYTES, SKIP*BYTES):
            data = mm[offset:offset+BYTES]
            sum += struct.unpack('h', data)[0]
    return sum


if sys.argv[1] == 'mmap':
    print bymmap()

if sys.argv[1] == 'file':
    print byfile()

我運行了兩次方法以補償緩存。 我用time因為我想測量user和sys時間。

結果如下：

[centos7:/tmp]$ time ./test.py file
-211990391

real    0m44.656s
user    0m35.978s
sys     0m8.697s
[centos7:/tmp]$ time ./test.py file
-211990391

real    0m43.091s
user    0m37.571s
sys     0m5.539s
[centos7:/tmp]$ time ./test.py mmap
-211990391

real    0m16.712s
user    0m15.495s
sys     0m1.227s
[centos7:/tmp]$ time ./test.py mmap
-211990391

real    0m16.942s
user    0m15.846s
sys     0m1.104s
[centos7:/tmp]$

（總和-211990391只驗證兩個版本做同樣的事情。）

查看每個版本的第二個結果， mmap()是實時的1/3。 用戶時間約為1/2，系統時間約為1/5。

您可能加快其速度的其他選擇是：

（1）如您所述，加載整個文件。 大I / O而不是小I / O 可以加快速度。 但是，如果你超過系統內存，你將回退到分頁，這將比mmap()更糟糕（因為你必須分頁）。 我不是很有希望，因為mmap已經在使用更大的I / O.

（2）並發。 也許通過多個線程並行讀取文件可以加快速度，但是你可以使用Python GIL來處理。 通過避免GIL，多處理將更好地工作，您可以輕松地將數據傳遞回頂級處理程序。 但是，這將對下一個項目，地點起作用：您可能會使您的I / O更加隨機。

（3）地點。 以某種方式組織您的數據（或訂購您的讀數），以便您的數據更加緊密。 mmap()根據系統pagesize以塊的形式分頁文件：

>>> import mmap
>>> mmap.PAGESIZE
4096
>>> mmap.ALLOCATIONGRANULARITY
4096
>>>

如果您的數據更靠近（在4k塊內），它將已經加載到緩沖區緩存中。

（4）更好的硬件。 像SSD一樣。

我確實在SSD上運行它，速度要快得多。 我運行了python的配置文件，想知道解壓縮是否昂貴。 不是：

$ python -m cProfile test.py mmap                                                                                                                        
121679286
         26843553 function calls in 8.369 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    6.204    6.204    8.357    8.357 test.py:24(bymmap)
        1    0.012    0.012    8.369    8.369 test.py:3(<module>)
 26843546    1.700    0.000    1.700    0.000 {_struct.unpack}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {posix.stat}
        1    0.453    0.453    0.453    0.453 {range}

附錄：

好奇心得到了我的好處，我嘗試了multiprocessing 。 我需要仔細查看我的分區，但是unpacks的數量（53687092）在試驗中是相同的：

$ time ./test2.py 4
[(4415068.0, 13421773), (-145566705.0, 13421773), (14296671.0, 13421773), (109804332.0, 13421773)]
(-17050634.0, 53687092)

real    0m5.629s
user    0m17.756s
sys     0m0.066s
$ time ./test2.py 1
[(264140374.0, 53687092)]
(264140374.0, 53687092)

real    0m13.246s
user    0m13.175s
sys     0m0.060s

碼：

#!/usr/bin/python

import functools
import multiprocessing
import mmap
import os
import struct
import sys

FILE = "/tmp/random"  # dd if=/dev/random of=/tmp/random bs=1024k count=1024
SIZE = os.stat(FILE).st_size
BYTES = 2
SKIP = 10


def bymmap(poolsize, n):
    partition = SIZE/poolsize
    initial = n * partition
    end = initial + partition
    sum = 0.0
    unpacks = 0
    with open(FILE, "r") as fd:
        mm = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
        for offset in xrange(initial, end, SKIP*BYTES):
            data = mm[offset:offset+BYTES]
            sum += struct.unpack('h', data)[0]
            unpacks += 1
    return (sum, unpacks)


poolsize = int(sys.argv[1])
pool = multiprocessing.Pool(poolsize)
results = pool.map(functools.partial(bymmap, poolsize), range(0, poolsize))
print results
print reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), results)

從Python中的二進制文件中提取特定字節

問題描述

1 個解決方案

解決方案1
5 已采納 2016-06-03 21:39:22

從Python中的二進制文件中提取特定字節

問題描述

1 個解決方案

解決方案1 5 已采納 2016-06-03 21:39:22

解決方案1
5 已采納 2016-06-03 21:39:22