简体   繁体   English

Numpy多维数组中的索引顺序

[英]Order of indexes in a Numpy multidimensional array

For example, say I'm simulating a bunch of particles doing something over time, and I have a multidimensional array called particles with these indexes: 例如,假设我正在模拟一堆粒子随着时间的推移做某事,我有一个多维数组,称为带有这些索引的particles

  • The x/y/z coordinates of the particle (of length a , which is 3 for a 3d space) 粒子的x / y / z坐标(长度为a ,对于3d空间为3
  • The index of the individual particle (of length b ) 单个粒子的指数(长度为b
  • The index of the time step it's on (of length c ) 它的时间步长指数(长度为c

Is it better to construct the array such that particles.shape == (a, b, c) or particles.shape == (c, b, a) ? 构造数组是否更好, particles.shape == (a, b, c)particles.shape == (c, b, a)

I'm more interested in convention than efficiency: Numpy arrays can be set up in either C-style (last index varies most rapidly) or Fortran-style (first index), so it can efficiently support either setup. 我对约定比对效率更感兴趣:Numpy数组可以设置为C风格(最后一个索引变化最快)或Fortran风格(第一个索引),因此它可以有效地支持任一设置。 I also realize I can use transpose to put the indexes in any order I need, but I'd like to minimize that. 我也意识到我可以使用transpose按照我需要的任何顺序放置索引,但我想最小化它。

I started to research this myself and found support for both ways: 我自己开始研究这个问题并找到了对这两种方式的支持:

Pro-(c,b,a): 亲(C,B,A):

  • By default, Numpy uses C-style arrays where the last index is the fastest-varying. 默认情况下,Numpy使用C风格的数组,其中最后一个索引变化最快。
  • Most of the vector algebra functions ( inner , cross , etc.) act on the last index. 大多数向量代数函数( innercross等)作用于最后一个索引。 ( dot acts on the last of one and the second-to-last of the other.) dot作用于最后一个,倒数第二个。)
  • The matplotlib collection objects ( LineCollection , PolyCollection ) expect arrays with the spatial coordinates in the last axis. matplotlib集合对象( LineCollectionPolyCollection )期望在最后一个轴中具有空间坐标的数组。

Pro-(a,b,c): 亲(A,B,C):

  • If i were to use meshgrid and mgrid to produce a set of points, it would put the spatial axis first. 如果我使用meshgridmgrid来生成一组点,它会将空间轴放在第一位。 For instance, np.mgrid[0:5,0:5,0:5].shape == (3,5,5,5) . 例如, np.mgrid[0:5,0:5,0:5].shape == (3,5,5,5) I realize these functions are mostly intended for integer array indexing , but it's not uncommon to use them to generate a grid of points. 我意识到这些函数主要用于整数数组索引 ,但使用它们生成点网格并不罕见。
  • The matplotlib scatter and plot functions split out their arguments, so it's agnostic to the shape of the array, but ax.plot3d(particles[0], particles[1], particles[2]) is shorter to type than the version with particles[..., 0] matplotlib scatterplot函数分割出它们的参数,因此它与数组的形状无关,但ax.plot3d(particles[0], particles[1], particles[2])类型比带particles[..., 0]的版本短particles[..., 0]

In general it appears that there are two different conventions in existence (probably due to historical differences between C and Fortran), and it's not clear which is more common in the Numpy community, or more appropriate for what I'm doing. 一般来说,似乎存在两种不同的约定(可能是由于C和Fortran之间的历史差异),并且不清楚哪个在Numpy社区中更常见,或者更适合我正在做的事情。

Conventions for something like this have much more to do with particular file-formats than anything else, in my experience. 根据我的经验,这种类型的约定与特定文件格式的关系远远超过其他任何东西。 However, there's a quick way to answer which one is likely to be best for what you're doing: 但是,有一种快速的方法可以回答哪一个可能最适合您正在做的事情:

If you have to iterate over an axis, which one are you most likely to iterate over? 如果你必须迭代一个轴,哪一个你最有可能迭代? In other words, which of these is most likely: 换句话说,最有可能的是:

# a first
for dimension in particles:
    ...

# b first
for particle in particles:
    ...

# c first
for timestep in particles:
    ...

As far as efficiency goes, this assumes C-order, but that's actually irrelevant here. 就效率而言,这假定为C阶,但这实际上与此无关。 At the python level, access to numpy arrays is treated as C-ordered regardless of the memory layout. 在python级别,无论内存布局如何,对numpy数组的访问都被视为C-ordered。 (You always iterate over the first axis, even if that's not the "most contiguous" axis in memory.) (您总是迭代第一个轴,即使这不是内存中“最连续”的轴。)

Of course, there are many situations where you should avoid directly iterating over numpy arrays in this matter. 当然,在很多情况下你应该避免在这个问题上直接迭代numpy数组。 Nonetheless, this is the way you should think about it, particularly when it comes to on-disk file structures. 尽管如此,这是您应该考虑的方式,特别是在磁盘文件结构方面。 Make your most common use case the fastest/easiest. 使您最常见的用例最快/最简单。

If nothing else, hopefully this gives you a useful way to think about the question. 如果不出意外,希望这能为您提供一个思考问题的有用方法。

Another bias is that when a new dimension has to be added, the numpy preference is to do so on the left. 另一个偏见是,当必须添加新维度时, numpy首选项是在左侧执行此操作。 That is x[None,...] is automatic 那是x[None,...]是自动的

np.array([x,y,z])   # produces a (3,...) array

np.ones((3,2)) + np.ones((1,2,10)) # error
np.ones((3,2,1)) + np.ones((2,10))  # (3,2,10)

But I don't see how this front-first broadcasting favors one position or the other for the x/y/z coordinate. 但我没有看到这个前置广播如何支持x/y/z坐标的一个位置或另一个位置。

While np.dot uses a convention of last/2nd to last , np.tensordot and np.einsum are much more general. 虽然np.dot使用last/2nd to last约定,但np.tensordotnp.einsum更通用。


Apocheir points out that doing a reduction on the last axis may require adding a newaxis back, eg Apocheir指出,在最后一个轴上进行减少可能需要添加一个newaxis ,例如

 x / np.linalg.norm(x,axis=0)   # automatic newaxis at beginning
 x / np.linalg.norm(x,axis=-1)[...,np.newaxis]  # explicit newaxis

for small x , this explicit newaxis adds measurable execution time. 对于小x ,这个显式newaxis增加了可测量的执行时间。 But for large x , the 2nd calculation is faster. 但对于大x ,第二次计算更快。 I think that's because reduction on the last axis is faster - that's the axis that changes faster (for order='C' ). 我认为这是因为最后一个轴的减少更快 - 这是更快的轴(对于order='C' )。

A number of the builtin reduction methods have a keepdims parameter to facilitate broadcasting in uses like this (eg sum , mean ). 许多内置缩减方法都有一个keepdims参数,以便于在这样的用途中进行广播(例如summean )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM