I'm trying to figure the best way to turn my data into a numpy/scipy sparse matrix. I don't need to do any heavy computation in this format. I just need to be able to convert data from a dense, too-large-for-memory csv to something I can pass it into an sklearn estimator. My theory is that the sparse-ified data should fit in memory.
Because all of the features are categorical, I'm using a generator to iterate over the file and the hashing trick to one hot encode everything:
def get_data(train=True):
if traindata:
path = '../originalData/train_rev1_short_short.csv'
else:
path = '../originalData/test_rev1_short.csv'
it = enumerate(open(path))
it.next() # burn the header row
x = [0] * 27 # initialize row container
for ix, line in it:
for ixx, f in enumerate(line.strip().split(',')):
# Record sample id
if ixx == 0:
sample_id = f
# If this is the training data, record output class
elif ixx == 1 and train:
c = f
# Use the hashing trick to one hot encode categorical features
else:
x[ixx] = abs(hash(str(ixx) + '_' + f)) % (2 ** 20)
yield (sample_id, x, c) if train else (sample_id, x)
The result are rows like this:
10000222510487979663 [1, 3, 66642, 433470, 960966, ..., 802612, 319257, 80942]
10000335031004381249 [1, 2, 87543, 394759, 183945, ..., 773845, 219833, 64573]
Where the first value is the sample ID and the list is the index values of the columns that have a '1' value.
What it is the most efficient way to turn this into a numpy/scipy sparse matrix? My only requirements are fast row-wise write/read and sklearn compatibility. Based on the scipy documentation, it seems like the CSR matrix is what I need, but I'm having some trouble figuring out to convert the data I have while using the generator construct.
Any advice? Open also to alternate approaches, I'm relatively new to problems like this.
Your data format is almost the internal structure of a scipy.sparse.lil_matrix
(list of lists). You should first generate one of those, and then call .tocsr()
on it to obtain the desired csr matrix.
A small example on how to populate these:
from scipy.sparse import lil_matrix
positions = [[1, 2, 10], [], [5, 6, 2]]
data = [[1, 1, 1], [], [1, 1, 1]]
l = lil_matrix((3, 11))
l.rows = positions
l.data = data
c = l.tocsr()
where data
is just a list of lists of ones mirroring the structure of positions
and positions
would correspond to your feature indices. As you can see, the attributes l.rows
and l.data
are real lists here, so you can append data as it comes. In that case you need to be careful with the shape
, though. When scipy
generates these lil_matrix
from other data, then it will put arrays of dtype object
, but those are almost lists, too.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.