[英]Is there a simpler and faster way to get an indexes dict in which contains the indexes of the same elements in a list or a numpy array
Description: 描述:
I have a large array with simple integers(positive and not large) like 1, 2, ..., etc. For example: [1, 1, 2, 2, 1, 2]. 我有一个包含简单整数(正数和不大数)的大型数组,例如1,2,...,等等。例如:[1、2、2、1、2]。 I want to get a dict in which use a single value from the list as the dict's key, and use the indexes list of this value as the dict's value. 我想要一个字典,其中使用列表中的单个值作为字典的键,并使用此值的索引列表作为字典的值。
Question: 题:
Is there a simpler and faster way to get the expected results in python? 有没有更简单,更快速的方法来在python中获得预期的结果? (array can be a list or a numpy array) (数组可以是列表或numpy数组)
Code: 码:
a = [1, 1, 2, 2, 1, 2]
results = indexes_of_same_elements(a)
print(results)
Expected results: 预期成绩:
{1:[0, 1, 4], 2:[2, 3, 5]}
You can avoid iteration here using vectorized methods, in particular np.unique
+ np.argsort
: 您可以在这里避免使用向量化方法进行迭代,尤其是np.unique
+ np.argsort
:
idx = np.argsort(a)
el, c = np.unique(a, return_counts=True)
out = dict(zip(el, np.split(idx, c.cumsum()[:-1])))
{1: array([0, 1, 4], dtype=int64), 2: array([2, 3, 5], dtype=int64)}
Performance 性能
a = np.random.randint(1, 100, 10000)
In [183]: %%timeit
...: idx = np.argsort(a)
...: el, c = np.unique(a, return_counts=True)
...: dict(zip(el, np.split(idx, c.cumsum()[:-1])))
...:
897 µs ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [184]: %%timeit
...: results = {}
...: for i, k in enumerate(a):
...: results.setdefault(k, []).append(i)
...:
2.61 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It is pretty trivial to construct the dict: 构造字典非常简单:
In []:
results = {}
for i, k in enumerate(a):
results.setdefault(k, []).append(i) # str(k) if you really need the key to be a str
print(results)
Out[]:
{1: [0, 1, 4], 2: [2, 3, 5]}
You could also use results = collections.defaultdict(list)
and then results[k].append(i)
instead of results.setdefault(k, []).append(i)
您也可以使用results = collections.defaultdict(list)
,然后使用results[k].append(i)
而不是results.setdefault(k, []).append(i)
We can exploit the fact that the elements are "simple" (ie nonnegative and not too large?) integers. 我们可以利用以下事实:元素是“简单的”(即非负且不是太大?)整数。
The trick is to construct a sparse matrix with just one element per row and then to transform it to a column wise representation. 诀窍是构造一个稀疏矩阵,每行只有一个元素,然后将其转换为按列表示。 This is typically faster than argsort
because this transform is O(M + N + nnz), if the sparse matrix is MxN with nnz nonzeros. 这通常比argsort
快,因为如果稀疏矩阵为nx非零的MxN,则此变换为O(M + N + nnz)。
from scipy import sparse
def use_sprsm():
x = sparse.csr_matrix((a, a, np.arange(a.size+1))).tocsc()
idx, = np.where(x.indptr[:-1] != x.indptr[1:])
return {i: a for i, a in zip(idx, np.split(x.indices, x.indptr[idx[1:]]))}
# for comparison
def use_asort():
idx = np.argsort(a)
el, c = np.unique(a, return_counts=True)
return dict(zip(el, np.split(idx, c.cumsum()[:-1])))
Sample run: 样品运行:
>>> a = np.random.randint(0, 100, (10_000,))
>>>
# sanity check, note that `use_sprsm` returns sorted indices
>>> for k, v in use_asort().items():
... assert np.array_equal(np.sort(v), use_sprsm()[k])
...
>>> timeit(use_asort, number=1000)
0.8930604780325666
>>> timeit(use_sprsm, number=1000)
0.38419671391602606
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.