How to convert generated id, list-of-index values tuple to a one hot encoded sparse matrix

Question

I'm trying to figure the best way to turn my data into a numpy/scipy sparse matrix. I don't need to do any heavy computation in this format. I just need to be able to convert data from a dense, too-large-for-memory csv to something I can pass it into an sklearn estimator. My theory is that the sparse-ified data should fit in memory.

Because all of the features are categorical, I'm using a generator to iterate over the file and the hashing trick to one hot encode everything:

def get_data(train=True):
    if traindata:
        path = '../originalData/train_rev1_short_short.csv'
    else:
        path = '../originalData/test_rev1_short.csv'

    it = enumerate(open(path))
    it.next()  # burn the header row
    x = [0] * 27  # initialize row container
    for ix, line in it:
        for ixx, f in enumerate(line.strip().split(',')):
            # Record sample id
            if ixx == 0:
                sample_id = f

            # If this is the training data, record output class
            elif ixx == 1 and train:
                c = f

            # Use the hashing trick to one hot encode categorical features
            else:
                x[ixx] = abs(hash(str(ixx) + '_' + f)) % (2 ** 20)

        yield (sample_id, x, c) if train else (sample_id, x)

The result are rows like this:

10000222510487979663 [1, 3, 66642, 433470, 960966, ..., 802612, 319257, 80942]
10000335031004381249 [1, 2, 87543, 394759, 183945, ..., 773845, 219833, 64573]

Where the first value is the sample ID and the list is the index values of the columns that have a '1' value.

What it is the most efficient way to turn this into a numpy/scipy sparse matrix? My only requirements are fast row-wise write/read and sklearn compatibility. Based on the scipy documentation, it seems like the CSR matrix is what I need, but I'm having some trouble figuring out to convert the data I have while using the generator construct.

Any advice? Open also to alternate approaches, I'm relatively new to problems like this.

Answer 1

Your data format is almost the internal structure of a scipy.sparse.lil_matrix (list of lists). You should first generate one of those, and then call .tocsr() on it to obtain the desired csr matrix.

A small example on how to populate these:

from scipy.sparse import lil_matrix

positions = [[1, 2, 10], [], [5, 6, 2]]
data = [[1, 1, 1], [], [1, 1, 1]]

l = lil_matrix((3, 11))
l.rows = positions
l.data = data

c = l.tocsr()

where data is just a list of lists of ones mirroring the structure of positions and positions would correspond to your feature indices. As you can see, the attributes l.rows and l.data are real lists here, so you can append data as it comes. In that case you need to be careful with the shape , though. When scipy generates these lil_matrix from other data, then it will put arrays of dtype object , but those are almost lists, too.

How to convert generated id, list-of-index values tuple to a one hot encoded sparse matrix

Question

1 answers

solution1
1 ACCPTED 2014-11-11 15:36:14

How to convert generated id, list-of-index values tuple to a one hot encoded sparse matrix

Question

1 answers

solution1 1 ACCPTED 2014-11-11 15:36:14

solution1
1 ACCPTED 2014-11-11 15:36:14