Python: Serializing/de-serializing huge amount of data

Question

I have a (very large) dataset. Something in the order of 250,000 binary vectors, each of size 800.
The dataset resides in a (.txt ascii coding) file, in 'compressed representation'. Meaning, every line in that file represents how a vector looks, rather than 800 characters of zeroes and ones.
For example, suppose that the i'th line in that file looks like this:

12 14 16 33 93 123 456 133

This means that the i'th vector is a vector with its 12'th 14'th, 16'th, ... 133'th indices holding the value 1, and the rest are zeroes.

The file's size is a little more than 30MB.

Now, since I use this data to feed a neural network, this data need some preprocessing, in order to transform it into what the network expects: list of size 250,000, where every element in that list is a 20x40 matrix (list of lists) of zeros and ones.
For example, if we rescale the problem to 4x2, this is what the final list looks like:

[[[1,0],[1,1],[0,0],[1,0]], [[0,0],[0,1],[1,0],[1,0]], ..., [[1,1],[0,1],[0,0],[1,1]]]

(only instead of 4x2 I have 20x40 matrices).

So I wrote two functions: load_data() - which parses the file and returns a list of 800 binary lists, and reshape() - which reshapes the lists to a 20x40 matrices.
Needless to say that my poor laptop works really hard when load_data() and reshape() are running. It takes about 7-9 minutes to complete the preprocessing, while in that time I can barley do anything else in my laptop. Even minimizing the IDE window is an extremely difficult task.
Since I use this data to adjust a neural net, I find myself very often killing the running process, re-tuning the network, and starting again - where every restart results with an invoke to load_data() followed by reshape() .
So, I decided to short-cut through this painful process of loading the data --> transforming to binary vectors --> reshaping it.
I want to load the data from the file, transform to binary vectors, reshaping it and serializing it to a file my_input .
Now, whenever I need to feed the network, I can just de-serialize the data from my_input , and spare me a lot of time.
This is how I did it:

input_file=open('my_input', 'wb')

print 'loading data from file...'
input_data=load_data()  # this will load the data from file and will re-encode it to binary vectors

print 'reshaping...'
reshaped_input=reshape(input_data)

print 'writing to file...'
cPickle.dump(reshaped_input, input_file, HIGHEST_PROTOCOL)
input_file.close()

The problem is this:
The resulting file is huge; 1.7GB in size, and it seems like the game is not worth the candle (I hope I used it right), since it takes too much time to load it (didn't measure how much, I just tried to load it, and after 9-10 minutes I gave up and killed the process).

Why is the resulting file is so much bigger than the original (I'd expect it to be bigger, but not by that much)?
Is there another way to encode the data (serialize/de-serialize wise), that will result in a smaller file, and will worth my while?
Or, alternatively, if anyone can suggest a better way to speed things up (besides buying a faster computer) that would also be great.

ps I don't care about compatibility issues when it comes to de-serializing. The only place where this data will ever be de-serialized is on my computer.

Answer 1

If you were to store a bit for each value in your data, you'd end up with a 25MB file; so your "compression" scheme is actually making your file bigger. The only advantage of your current scheme is that you get to store your data in ascii.

Calculation:

250.000 * 800 bits = 250.000 * 100 bytes = 25.000.000 bytes = 25 MB

So just store the bit patterns manually, read them back in and go on with your calculations.

Edit: Looks like the path of least resistance is to use the third-party module packbits (ie you need to download it). You must first flatten your long list of matrices into a flat list on the fly (as an iterator), write it out as a sequence of bits (note: every 32-bit int can be "packed" with 32 values-- not just one value as you suggest in the comments), then do the reverse conversion on input. List-flattening recipes are a dime a dozen (see here for a selection), but here's one with complementary unflattening code.

from itertools import zip_longest    
def chunks(iterable, size):
    "chunks(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g', 0, 0)"
    return zip_longest(*[iter(iterable)]*size, fillvalue=0)

def flatten(data):
    """Convert a list of N x M matrices into a flat iterator"""
    return ( v for matrix in data for row in matrix for v in row )

def unflatten(data, n, m):
    """Convert a flat sequence (of ints) into a list of `n` by `m` matrices"""
    msize = n * m
    for chunk in chunks(data, msize):
        yield [ chunk[i:i+m] for i in range(0, msize, m) ]

If sampledata is your sample array of 4 x 2 matrices,

rt = list(unflatten(flatten(sampledata), 4, 2))

is a list with the same structure and values (but tuples instead of row arrays). Can you fill in the rest?

Python: Serializing/de-serializing huge amount of data

Question

1 answers

solution1
2 ACCPTED 2015-06-17 13:41:46

Python: Serializing/de-serializing huge amount of data

Question

1 answers

solution1 2 ACCPTED 2015-06-17 13:41:46

solution1
2 ACCPTED 2015-06-17 13:41:46