简体   繁体   English

如何有效地在python中创建大型列表列表?

[英]How to create iterate through a large list of list in python efficiently?

I have my data as such: 我有这样的数据:

data = {'x':Counter({'a':1,'b':45}), 'y':Counter({'b':1, 'c':212})}

where my labels are the keys of the data and the key of the inner dictionary are features: 我的标签是data的键,内部字典的键是功能:

all_features = ['a','b','c']
all_labels = ['x','y']

I need to create list of list as such: 我需要创建列表列表:

[[data[label][feat] for feat in all_features] for label in all_labels]

[out]: [OUT]:

[[1, 45, 0], [0, 1, 212]]

My len(all_features) is ~5,000,000 and len(all_labels) is ~100,000 我的len(all_features)是~5,000,000而len(all_labels)是~100,000

The ultimate purpose is to create scipy sparse matrix, eg: 最终目的是创建scipy稀疏矩阵,例如:

from collections import Counter
from scipy.sparse import csc_matrix
import numpy as np


all_features = ['a','b','c']
all_labels = ['x','y']

csc_matrix(np.array([[data[label][feat] for feat in all_features] for label in all_labels]))

but looping through a large list of list is rather inefficient. 但循环遍历大量列表是相当低效的。

So how can I look the large list of list efficiently? 那么如何才能有效地查看大量列表呢?

Is there other way to create the scipy matrix from the data without looping through all features and labels? 还有其他方法可以从data创建scipy矩阵而不循环遍历所有功能和标签吗?

Converting a dictionary of dictionaries into a numpy or scipy array is, as you are experiencing, not too much fun. 正如您所经历的那样,将字典字典转换为numpy或scipy数组并不是太有趣。 If you know all_features and all_labels before hand, you are probably better off using a scipy sparse COO matrix from the start to keep your counts. 如果您all_features了解all_featuresall_labels ,那么最好从一开始就使用scipy稀疏COO矩阵来保持计数。

Whether that is possible or not, you will want to keep your lists of features and labels in sorted order, to speed up look ups. 无论是否可行,您都希望按排序顺序保留功能和标签列表,以加快查找速度。 So I am going to assume that the following doesn't change either array: 所以我假设以下内容不会改变任何一个数组:

all_features = np.array(all_features)
all_labels = np.array(all_labels)
all_features.sort()
all_labels.sort()

Lets extract the labels in data in the order they are stored in the dictionary, and see where in all_labels does each item fall: 让我们按照它们存储在字典中的顺序提取data中的标签,并查看all_labels中每个项目的位置:

labels = np.fromiter(data.iterkeys(), all_labels.dtype, len(data))
label_idx = np.searchsorted(all_labels, labels)

Now lets count how many features does each label has, and compute from it the number of non-zero items there will be in your sparse array: 现在让我们计算每个标签有多少个特征,并从中计算稀疏数组中非零项的数量:

label_features = np.fromiter((len(c) for c in data.iteritems()), np.intp,
                             len(data))
indptr = np.concatenate(([0], np.cumsum(label_features)))
nnz = indptr[-1]

Now, we extract the features for each label, and their corresponding counts 现在,我们提取每个标签的功能及其相应的计数

import itertools
features_it = itertools.chain(*(c.iterkeys() for c in data.itervalues()))
features = np.fromiter(features_it, all_features.dtype, nnz)
feature_idx = np.searchsorted(all_features, features)
counts_it = itertools.chain(*(c.itervalues() for c in data.itervalues()))
counts = np.fromiter(counts_it, np.intp, nnz)

With what we have, we can create a CSR matrix directly, with labels as rows and features as columns: 有了我们所拥有的,我们可以直接创建CSR矩阵,标签为行,功能为列:

sps_data = csr_matrix((counts, feature_idx, indptr),
                      shape=(len(all_labels), len(all_features)))

The only issue is that the rows of this sparse array are not in the order of all_labels , but in the order they came up when iterating over data . 唯一的问题是这个稀疏数组的行不是all_labels的顺序,而是按迭代data时出现的顺序排列。 But we have feature_idx telling us where did each label end up, and we can rearrange the rows by doing: 但我们有feature_idx告诉我们每个标签的最终位置,我们可以通过执行以下操作重新排列行:

sps_data = sps_data[np.argsort(label_idx)]

Yes, it is messy, confusing, and probably not very fast, but it works, and it will be much more memory efficient that what you proposed in your question: 是的,它是混乱的,令人困惑的,可能不是很快,但它的工作原理,它会比你在你的问题中提出的内存效率更高:

>>> sps_data.A
array([[  1,  45,   0],
       [  0,   1, 212]], dtype=int64)
>>> all_labels
array(['x', 'y'], 
      dtype='<S1')
>>> all_features
array(['a', 'b', 'c'], 
      dtype='<S1')

The dataset is quite large, so I don't think is practical to construct a temporary numpy array (if 32 bit integers are used a 1e5 x 5e6 matrix would require ~2 Terabytes of memory). 数据集非常大,所以我认为构造一个临时的numpy数组是不切实际的(如果使用32位整数,则1e5 x 5e6矩阵需要大约2TB的内存)。

I assume you know the upper bound for number of labels. 我假设您知道标签数量的上限。

The code could look like: 代码可能如下所示:

import scipy.sparse
n_rows = len(data.keys())
max_col = int(5e6)
temp_sparse = scipy.sparse.lil_matrix((n_rows, max_col), dtype='int')

for i, (features, counts) in enumerate(data.iteritems()):
    for label, n in counts.iteritem():
        j = label_pos[label]
        temp_sparse[i, j] = n
csc_matrix = temp_sparse.csc_matrix(temp_matrix)

Where label_pos returns the column-index of the label. label_pos返回标签的列索引。 If turns out is not practical to use a dictionary for storing the index of 5 millions of labels a hard drive database should do. 如果使用字典来存储硬盘数据库应该做的5百万个标签的索引是不切实际的。 The dictionary could be create online, so previous knowledge of all the labels is not necessary. 字典可以在线创建,因此不需要先前了解所有标签。

Iterating through 100,000 features would take a reasonable time, so I think this solution could work if the dataset is sparse enough. 迭代100,000个特征需要一段合理的时间,所以我认为如果数据集足够稀疏,这个解决方案可以工作。 Good luck! 祝好运!

s there other way to create the scipy matrix from the data without looping through all features and labels? 有没有其他方法从数据创建scipy矩阵而不循环所有功能和标签?

I don't think there is any short-cut that reduces the total number of lookups. 我认为没有任何捷径可以减少查找总数。 You're starting with a dictionary of Counters (a dict subclass) so both levels of nesting are unordered collections. 你开始使用计数器字典(一个字典子类),因此嵌套级别都是无序集合。 The only way to put them back in required order is to do a data[label][feat] lookup for every data point. 将它们按所需顺序放回的唯一方法是为每个数据点执行data[label][feat]查找。

You can cut the time roughly in half by making sure the data[label] lookup is only done once per label: 通过确保每个标签只执行一次data[label]查询,您可以将时间缩短一半:

>>> counters = [data[label] for label in all_labels]
>>> [[counter[feat] for feat in all_features] for counter in counters]
[[1, 45, 0], [0, 1, 212]]

You can also try speeding the running time by using map() instead of a list comprehension (mapping can take advantage of the internal length_hint to pre-size the result array): 您还可以尝试使用map()而不是列表推导来加快运行时间(映射可以利用内部length_hint来预先调整结果数组的大小):

>>> [map(counter.__getitem__, all_features) for counter in counters]
[[1, 45, 0], [0, 1, 212]]

Lastly, be sure to run the code inside a function (local variable lookups in CPython are faster than global variable lookups): 最后,确保在函数内运行代码(CPython中的局部变量查找比全局变量查找更快):

def f(data, all_features, all_labels):
    counters = [data[label] for label in all_labels]
    return [map(counter.__getitem__, all_features) for counter in counters]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM