简体   繁体   English

如何根据ID对齐的数据从多个向量构造一个numpy数组

[英]How to construct a numpy array from multiple vectors with data aligned by id

I am using Python , numpy and scikit-learn . 我正在使用Pythonnumpyscikit-learn I have data of keys and values that are stored in an SQL table. 我有存储在SQL表中的值的数据。 I retrieve this as a list of tuples returned as: [(id, value),...] . 我将其检索为以[(id, value),...]返回的元组列表。 Each id appears only once in the list and the tuples appear sorted in order of ascending id. 每个ID在列表中仅出现一次,并且元组按ID升序显示。 This process is completed a few times so that I have multiple lists of key: value pairs. 此过程完成了几次,因此我有多个key: value对列表。 Such that: 这样:

dataset = []
for sample in samples:
    listOfTuplePairs = getDataFromSQL(sample)    # get a [(id, value),...] list
    dataset.append(listOfTuplePairs)

Keys may be duplicated across different samples, and each row may be of a different length. 密钥可以跨不同的样本重复,并且每一行的长度可以不同。 An example dataset might be: 示例dataset可能是:

dataset = [[(1, 0.13), (2, 2.05)],
           [(2, 0.23), (4, 7.35), (5, 5.60)],
           [(2, 0.61), (3, 4.45)]]

It can be seen that each row is a sample, and that some ids (in this case 2) appear in multiple samples. 可以看出,每一行都是一个样本,并且某些ID(在本例中为2)出现在多个样本中。

Problem: I wish to construct a single (possibly sparse) numpy array suitable for processing with scikit-learn. 问题:我希望构造一个适用于scikit-learn处理的单个 (可能是稀疏的)numpy数组。 The values relating to a specific key (id) for each sample should be aligned in the same 'column' (if that is the correct terminology) such that the matrix of the above example would look as follows: 与每个样本的特定键(id)有关的值应在同一“列”中对齐(如果这是正确的术语),以使上述示例的矩阵如下所示:

    ids =     1    2     3      4    5
          ------------------------------
dataset = [(0.13, 2.05, null, null, null),
           (null, 0.23, null, 7.35, 5.60),
           (null, 0.61, 4.45, null, null)]

As you can see, I also wish to strip the ids from the matrix (though I will need to retain a list of them so I know what the values in the matrix relate to. Each initial list of key: value pairs may contain several thousand rows and there may be several thousand samples so the resulting matrix may be very large. Please provide answers that consider speed (within the limits of Python), memory efficiency and code clarity. 如您所见,我还希望从矩阵中删除ID(尽管我需要保留ID的列表,以便知道矩阵中的值与之相关。每个key: value初始列表key: value对可能包含数千个行,并且可能有数千个样本,因此生成的矩阵可能会非常大。请提供考虑速度(在Python限制内),内存效率和代码清晰度的答案。

Many, many thanks in advance for any help. 非常感谢您的任何帮助。

Here's a NumPy based approach to create a sparse matrix coo_matrix with memory efficiency in focus - 这是一种基于NumPy的方法,用于创建稀疏矩阵coo_matrix并重点关注内存效率-

from scipy.sparse import coo_matrix

# Construct row IDs
lens = np.array([len(item) for item in dataset])
shifts_arr = np.zeros(lens.sum(),dtype=int)
shifts_arr[lens[:-1].cumsum()] = 1
row = shifts_arr.cumsum()

# Extract values from dataset into a NumPy array
arr = np.concatenate(dataset)

# Get the unique column IDs to be used for col-indexing into output array
col = np.unique(arr[:,0],return_inverse=True)[1]

# Determine the output shape
out_shp = (row.max()+1,col.max()+1)

# Finally create a sparse marix with the row,col indices and col-2 of arr
sp_out = coo_matrix((arr[:,1],(row,col)), shape=out_shp)

Please note that if the IDs are supposed to be column numbers in the output array, you could replace the use of np.unique that gives us such unique IDs with something like this - 请注意,如果这些IDs应该是输出数组中的列号,则可以使用np.unique代替,该np.unique可以为我们提供这样的唯一ID,如下所示:

col = (arr[:,0]-1).astype(int)

This should give us a good performance boost! 这应该给我们带来良好的性能提升!

Sample run - 样品运行-

In [264]: dataset = [[(1, 0.13), (2, 2.05)],
     ...:            [(2, 0.23), (4, 7.35), (5, 5.60)],
     ...:            [(2, 0.61), (3, 4.45)]]

In [265]: sp_out.todense() # Using .todense() to show output
Out[265]: 
matrix([[ 0.13,  2.05,  0.  ,  0.  ,  0.  ],
        [ 0.  ,  0.23,  0.  ,  7.35,  5.6 ],
        [ 0.  ,  0.61,  4.45,  0.  ,  0.  ]])

You can convert each element in the dataset to a dictionary and then use pandas data frame which will return the result close to the desired output. 您可以将数据集中的每个元素转换为字典,然后使用pandas数据框,它将返回接近所需输出的结果。 If 2D numpy array is desired we can use as_matrix() method to convert the data frame to numpy array: 如果需要2D numpy数组,我们可以使用as_matrix()方法将数据帧转换为numpy数组:

import pandas as pd
pd.DataFrame(dict(x) for x in dataset).as_matrix()

# array([[ 0.13,  2.05,   nan,   nan,   nan],
#        [  nan,  0.23,   nan,  7.35,  5.6 ],
#        [  nan,  0.61,  4.45,   nan,   nan]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM