简体繁体 English

Numpy阵列的维数选择

[英]Choice of Dimension on Numpy Arrays

原文 2014-01-28 22:11:11 0 1 python/ arrays/ memory/ numpy/ multidimensional-array

I have a dataset I wish to analyze. 我有一个我想分析的数据集。 It consists of 它包括

measurements, total number m , which is roughly 2 000 000. 测量，总数m ，大约为2 000 000。
each measurement contains v variables. 每个测量包含v变量。 (About 10 in this case) （在这种情况下大约10）

I can name each variable (foo, bar, etc) and choose each of them a datatype , such as uint32. 我可以命名每个变量（foo，bar等），并为每个变量选择一种数据类型，例如uint32。

I want to calculate statistics (such as the mean value of foo ) and draw plots (such as scatterplot of foo vs bar ). 我想计算统计数据（例如foo的平均值）和绘制图（例如foo与bar的散点图）。

Which representation should I choose for my data (or does it even matter)? 我应该为我的数据选择哪种表示（或者甚至是否重要）？

an array of m rows and v columns or m行和v列的数组或
an array of v rows and m columns v行和m列的数组

I would generally expect better performance if each variable is assigned a continuous block of memory, but will be happy to be proven wrong if someone provides a clear answer. 如果为每个变量分配一个连续的内存块，我通常会期望更好的性能，但如果有人提供了明确的答案，我将很高兴被证明是错误的。

Bonus question: Best way to iteratively build the array? 奖金问题：迭代构建阵列的最佳方法是什么？

1 个解决方案

First things first: Remember that premature optimization is the root of all evil. 首先要做的事情是：记住过早优化是万恶之源。 You can always use the timeit module if you suspect something is slow. 如果您怀疑某些事情很慢，您可以随时使用timeit模块。

As for your question, I store my data such that the measurements are indexed by rows and the dimensions are indexed by columns. 至于你的问题，我存储我的数据，以便测量按行索引，尺寸按列索引。 This way the measurements themselves will (probably*) be continuous is memory. 这样，测量本身（可能是*）是连续的记忆。 But the real reason is if I have measurements M.shape = (m, v) then M[n] will access the nth measurement, and that lends itself well to tidy code. 但真正的原因是如果我有测量M.shape =（m，v）那么M [n]将访问第n次测量，这很适合整洁的代码。

*a numpy array may not be continuous is memory if it is built oddly. *如果奇怪地构建了numpy数组可能不是连续的内存。 np.ascontinuous will fix this. np.ascontinuous将解决这个问题。

BONUS: 奖金：

If you are iteravely building an array it is best not to use a numpy array to begin with. 如果你正在迭代地构建一个数组，最好不要使用numpy数组开始。 They are not meant to be resized easilly. 它们并不意味着易于调整大小。 I would store all your data in a python list and use the append function to add new measurement. 我会将所有数据存储在python列表中，并使用append函数添加新的测量。 Then when you need to put the data in numpy you can call np.array on the python list and it will copy the data so efficient operations can be performed on it. 然后，当您需要将数据放入numpy时，您可以在python列表上调用np.array，它将复制数据，以便对其执行有效的操作。 But for dynamic storage go with python lists. 但是对于动态存储，请使用python列表。 They are very efficient. 他们非常有效率。