The fastest way to read binary files by bytes in Python

Question

I am making a program, which should be able to encode any type of file using huffman algorithm. It all works, but using it on large files is too slow (at least I think it is). When I tried to open an 120MB mp4 file to unpack it, it took me about 210s just to read the file. Not to mention that it took a large chunk of memory to do so. I thought unpacking using struct would be efficient, but it isnt. Isn't there more effiecent way to do it in python? I need to read any file by bytes and then pass it to the huffman method in string.

if __name__ == "__main__":
    start = time.time()
    with open('D:\mov.mp4', 'rb') as f:
        dataL = f.read()
    data = np.zeros(len(dataL), 'uint8')

    for i in range(0, len(dataL)):
        data[i] = struct.unpack('B', dataL[i])[0]

    data.tostring()

    end = time.time()
    print("Original file read: ")
    print end - start

    encoded, table = huffman_encode(data)

Answer 1

Your approach is loading a file into a python object -> creating an empty Numpy array then filling the Numpy array bit by bit using a Python iterator.

Lets take out the middlemen:

if __name__ == "__main__":
    start = time.time()
    data = np.fromfile('d:\mov.mp4', dtype=np.uint8, count=-1)
    end = time.time()
    print("Original file read: ")
    print end - start
    encoded, table = huffman_encode(data)

What to do with 'data' depends on what type of data your huffman_encode(data) will receive. I would try to avoid using strings.

Documentation on the call is here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html

I would be interested to hear the speed differences in the comments :)

The fastest way to read binary files by bytes in Python

Question

1 answers

solution1
2 ACCPTED 2015-10-30 09:10:09

The fastest way to read binary files by bytes in Python

Question

1 answers

solution1 2 ACCPTED 2015-10-30 09:10:09

solution1
2 ACCPTED 2015-10-30 09:10:09