简体   繁体   English

如何将生成的id,索引列表值元组转换为一个热编码的稀疏矩阵

[英]How to convert generated id, list-of-index values tuple to a one hot encoded sparse matrix

I'm trying to figure the best way to turn my data into a numpy/scipy sparse matrix. 我正在尝试找到将数据转换为numpy / scipy稀疏矩阵的最佳方法。 I don't need to do any heavy computation in this format. 我不需要这种格式的任何繁重的计算。 I just need to be able to convert data from a dense, too-large-for-memory csv to something I can pass it into an sklearn estimator. 我只需要能够将数据从密集的,对于内存而言太大的csv转换为可以将其传递给sklearn估计器的数据即可。 My theory is that the sparse-ified data should fit in memory. 我的理论是,稀疏化的数据应该适合内存。

Because all of the features are categorical, I'm using a generator to iterate over the file and the hashing trick to one hot encode everything: 因为所有功能都是分类的,所以我使用生成器来遍历文件,并且使用哈希技巧对所有内容进行热编码:

def get_data(train=True):
    if traindata:
        path = '../originalData/train_rev1_short_short.csv'
    else:
        path = '../originalData/test_rev1_short.csv'

    it = enumerate(open(path))
    it.next()  # burn the header row
    x = [0] * 27  # initialize row container
    for ix, line in it:
        for ixx, f in enumerate(line.strip().split(',')):
            # Record sample id
            if ixx == 0:
                sample_id = f

            # If this is the training data, record output class
            elif ixx == 1 and train:
                c = f

            # Use the hashing trick to one hot encode categorical features
            else:
                x[ixx] = abs(hash(str(ixx) + '_' + f)) % (2 ** 20)

        yield (sample_id, x, c) if train else (sample_id, x)

The result are rows like this: 结果是这样的行:

10000222510487979663 [1, 3, 66642, 433470, 960966, ..., 802612, 319257, 80942]
10000335031004381249 [1, 2, 87543, 394759, 183945, ..., 773845, 219833, 64573]

Where the first value is the sample ID and the list is the index values of the columns that have a '1' value. 第一个值是样品ID,列表是具有“ 1”值的列的索引值。

What it is the most efficient way to turn this into a numpy/scipy sparse matrix? 将其转换为numpy / scipy稀疏矩阵的最有效方法是什么? My only requirements are fast row-wise write/read and sklearn compatibility. 我唯一的要求是快速的按行写/读和sklearn兼容性。 Based on the scipy documentation, it seems like the CSR matrix is what I need, but I'm having some trouble figuring out to convert the data I have while using the generator construct. 根据scipy文档,似乎我需要CSR矩阵 ,但是在使用生成器构造时很难确定如何转换我拥有的数据。

Any advice? 有什么建议吗? Open also to alternate approaches, I'm relatively new to problems like this. 对其他方法也开放,我对这样的问题还比较陌生。

Your data format is almost the internal structure of a scipy.sparse.lil_matrix (list of lists). 您的数据格式几乎是scipy.sparse.lil_matrix (列表列表)的内部结构。 You should first generate one of those, and then call .tocsr() on it to obtain the desired csr matrix. 您应该首先生成其中的一个,然后对其调用.tocsr()以获得所需的csr矩阵。

A small example on how to populate these: 一个有关如何填充这些内容的小示例:

from scipy.sparse import lil_matrix

positions = [[1, 2, 10], [], [5, 6, 2]]
data = [[1, 1, 1], [], [1, 1, 1]]

l = lil_matrix((3, 11))
l.rows = positions
l.data = data

c = l.tocsr()

where data is just a list of lists of ones mirroring the structure of positions and positions would correspond to your feature indices. 其中的data只是一个反映positions结构的列表的列表, positions将与您的要素索引相对应。 As you can see, the attributes l.rows and l.data are real lists here, so you can append data as it comes. 如您所见,属性l.rowsl.data是此处的真实列表,因此您可以在出现数据时附加它们。 In that case you need to be careful with the shape , though. 在这种情况下,您需要注意shape When scipy generates these lil_matrix from other data, then it will put arrays of dtype object , but those are almost lists, too. scipy从其他数据生成这些lil_matrix ,它将放置lil_matrix dtype object数组,但它们也几乎是列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM