简体   繁体   English

排序和解压缩大字节数组的最快方法是什么?

[英]What is the fastest way to sort and unpack a large bytearray?

I have a large binary file that needs to be converted into hdf5 file format. 我有一个很大的二进制文件,需要将其转换为hdf5文件格式。

I am using Python3.6. 我正在使用Python3.6。 My idea is to read in the file, sort the relevant information, unpack it and store it away. 我的想法是读入文件,对相关信息进行分类,将其解压缩并存储起来。 My information is stored in a way that the 8 byte time is followed by 2 bytes of energy and then 2 bytes of extra information, then again time, ... My current way of doing it, is the following (my information is read as an bytearray, with the name byte_array): 我的信息以以下方式存储:8字节的时间紧随其后的是2字节的能量,然后是2字节的额外信息,然后是时间,……我目前的处理方式如下(我的信息读为一个字节数组,名称为byte_array):

for i in range(0, len(byte_array)+1, 12):

    if i == 0:
        timestamp_bytes = byte_array[i:i+8]
        energy_bytes = byte_array[i+8:i+10]
        extras_bytes = byte_array[i+10:i+12]
    else:
        timestamp_bytes += byte_array[i:i+8]
        energy_bytes += byte_array[i+8:i+10]
        extras_bytes += byte_array[i+10:i+12]


timestamp_array = np.ndarray((len(timestamp_bytes)//8,), '<Q',timestamp_bytes)
energy_array = np.ndarray((len(energy_bytes) // 2,), '<h', energy_bytes)
extras_array = np.ndarray((len(timestamp_bytes) // 8,), '<H', extras_bytes)

I assume there is a much faster way of doing this, maybe by avoiding to loop over the the whole thing. 我认为有一个更快的方法可以执行此操作,也许可以避免遍历整个过程。 My files are up to 15GB in size so every bit of improvement would help a lot. 我的文件最大为15GB,因此每一点改进都会有很大帮助。

You should be able to just tell NumPy to interpret the data as a structured array and extract fields: 您应该能够告诉NumPy将数据解释为结构化数组并提取字段:

as_structured = numpy.ndarray(shape=(len(byte_array)//12,),
                              dtype='<Q, <h, <H',
                              buffer=byte_array)
timestamps = as_structured['f0']
energies = as_structured['f1']
extras = as_structured['f2']

This will produce three arrays backed by the input bytearray. 这将产生由输入字节数组支持的三个数组。 Creating these arrays should be effectively instant, but I can't guarantee that working with them will be fast - I think NumPy may need to do some implicit copying to handle alignment issues with these arrays. 创建这些数组实际上应该是即时的,但是我不能保证与它们​​的合作会很快-我认为NumPy可能需要执行一些隐式复制来处理这些数组的对齐问题。 It's possible (I don't know) that explicitly copying them yourself with .copy() first might speed things up. 有可能(我不知道),首先使用.copy()显式复制自己可能会加快速度。

You can use numpy.frombuffer with a custom datatype: 您可以将numpy.frombuffer与自定义数据类型一起使用:

import struct
import random

import numpy as np


data = [
    (random.randint(0, 255**8), random.randint(0, 255*255), random.randint(0, 255*255))
    for _ in range(20)
    ]

Bytes = b''.join(struct.pack('<Q2H', *row) for row in data)
dtype = np.dtype([('time', np.uint64), 
                  ('energy', np.uint16), # you may need to change that to `np.int16`, if energy can be negative
                  ('extras', np.uint16)])

original = np.array(data, dtype=np.uint64)
result = np.frombuffer(Bytes, dtype)

print((result['time'] == original[:, 0]).all())
print((result['energy'] == original[:, 1]).all())
print((result['extras'] == original[:, 2]).all())

print(result)

Example output: 输出示例:

True
True
True
[(6048800706604665320, 52635, 291) (8427097887613035313, 15520, 4976)
 (3250665110135380002, 44078, 63748) (17867295175506485743, 53323, 293)
 (7840430102298790024, 38161, 27601) (15927595121394361471, 47152, 40296)
 (8882783920163363834, 3480, 46666) (15102082728995819558, 25348, 3492)
 (14964201209703818097, 60557, 4445) (11285466269736808083, 64496, 52086)
 (6776526382025956941, 63096, 57267) (5265981349217761773, 19503, 32500)
 (16839331389597634577, 49067, 46000) (16893396755393998689, 31922, 14228)
 (15428810261434211689, 32003, 61458) (5502680334984414629, 59013, 42330)
 (6325789410021178213, 25515, 49850) (6328332306678721373, 59019, 64106)
 (3222979511295721944, 26445, 37703) (4490370317582410310, 52413, 25364)]

I'm not an expert on numpy, but here's my 5 cents: You have lots of data, and probably it's more than your RAM. 我不是numpy的专家,但是这是我的5美分:您有很多数据,可能比RAM还要多。 This points to the simplest solution - don't try to fit all data in your program. 这指向了最简单的解决方案-不要尝试在程序中容纳所有数据。 When you read a file into a variable - then the X GB is being read into RAM. 当您将文件读取到变量中时-X GB被读取到RAM中。 If it's more than available RAM, then swapping is done by your OS. 如果超过可用的RAM,则交换由您的OS完成。 Swapping slows you down, since not only do you have disk operations for reading from source file, now you also have writing to disk to dump RAM contents into swap file. 交换会减慢您的速度,因为不仅您有磁盘操作可从源文件读取,而且现在您还写入磁盘以将RAM内容转储到交换文件中。 Instead of that write the script so that it uses parts of the input file as necessary (in your case you read the file along anyways and don't go back or jump far ahead). 而不是编写脚本,以便它在必要时使用输入文件的某些部分(在您的情况下,无论如何您都应读取文件,而不要回头或跳远)。

Try opening the input file as memory mapped data structure (please note differences in usage between Unix and windows environments) 尝试将输入文件作为内存映射的数据结构打开(请注意Unix和Windows环境在用法上的差异)

Then you can do simple read([n]) bytes at a time and append that to your arrays. 然后,您可以一次执行简单的read([n])字节,并将其附加到数组中。 behind the scenes data is read into RAM page by page as needed and will not exceed the available memory, also, leaving more space for your arrays to grow. 在幕后,数据将根据需要 逐页读取到RAM 中,并且不会超出可用内存,也为阵列保留了更多空间。

Also consider the fact that your resultant arrays can also outgrow RAM, which will cause similar slowdown as reading of a big file. 还请考虑以下事实:结果数组也可能超出RAM,这将导致与读取大文件类似的速度降低。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM