[英]Convert a Pandas DataFrame to a multidimensional ndarray
I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray. 我有一个DataFrame,其中包含x,y,z坐标和此位置的值,我希望将其转换为3维ndarray。
To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray). 为了使事情变得更复杂,DataFrame中并不存在所有值(这些值可以在ndarray中替换为NaN)。
Just a simple example: 只是一个简单的例子:
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
Should result in the ndarray: 应该导致ndarray:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
For two dimensions, this is easy: 对于两个维度,这很容易:
array = df.pivot_table(index="y", columns="x", values="value").as_matrix()
However, this method can not be applied to three or more dimensions. 但是,此方法不能应用于三维或更多维度。
Could you give me some suggestions? 你能给我一些建议吗?
Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing). 奖励点如果这也适用于三个以上的维度,处理多个定义的值(通过取平均值)并确保所有x,y,z坐标是连续的(通过在缺少坐标时插入NaN的行/列)。
EDIT: Some more explanations: 编辑:更多解释:
I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. 我从CSV文件中读取数据,该文件包含x,y,z坐标列,可选地包括此点和频率的频率和测量值。 Then I round the coordinates to a specified precision (eg 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates.
然后我将坐标舍入到指定的精度(例如0.1m),并希望获得一个ndarray,其中包含每个(圆形)坐标处的平均测量值。 The indizes of the values do not need to coincide with the location.
值的指示不需要与位置一致。 However they need to be in the correct order.
但是他们需要按正确的顺序排列。
EDIT: I just ran a quick performance test: 编辑:我刚刚进行了快速的性能测试:
The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete. jakevdp的解决方案需要1.598秒,Divikars解决方案需要7.405秒,JohnE的解决方案需要7.867秒,Wens解决方案需要6.286秒才能完成。
You can use a groupby
followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into nD Numpy array : 您可以使用
groupby
然后使用带有n级层次索引的Transform Pandas DataFrame进入nD Numpy数组 :
grouped = df.groupby(['z', 'y', 'x'])['value'].mean()
# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat
print(arr)
# [[[ 1. 2. nan]
# [ 3. nan 4.]]
#
# [[ 5. 6. 7.]
# [ 8. 9. nan]]]
Here's one NumPy approach - 这是一种NumPy方法 -
def dataframe_to_array_averaged(df):
arr = df[['z','y','x']].values
arr -= arr.min(0)
out_shp = arr.max(0)+1
L = np.prod(out_shp)
val = df['value'].values
ids = np.ravel_multi_index(arr.T, out_shp)
avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
return avgs.reshape(out_shp)
Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0
= NaN
, but since that's the expected output for those places, you can ignore the warning there. 请注意,这显示了一个警告,因为对于没有x,y,z三元组的地方将没有计数,因此平均值将是
0/0
= NaN
,但由于这是这些地方的预期输出,您可以忽略警告那里。 To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method). 为避免此警告,我们可以使用索引,如第二种方法(替代方法)中所述。
Sample run - 样品运行 -
In [106]: df
Out[106]:
value x y z
0 1 1 1 1 # <=== this is repeated
1 2 2 1 1
2 3 1 2 1
3 4 3 2 1
4 5 1 1 2
5 6 2 1 2
6 7 3 1 2
7 8 1 2 2
8 9 2 2 2
9 4 1 1 1 # <=== this is repeated
In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]:
array([[[ 2.5, 2. , nan],
[ 3. , nan, 4. ]],
[[ 5. , 6. , 7. ],
[ 8. , 9. , nan]]])
Alternative method 替代方法
To avoid warning, an alternative way would be like so - 为了避免警告,另一种方式是这样的 -
out = np.full(out_shp, np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count
Another solution is to use the xarray
package: 另一个解决方案是使用
xarray
包:
import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2],
'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)
Output: 输出:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
Note that the xrTensor
object is very handy since xarray's DataArray
s contain the labels so you may just go on with that object rather pulling out the ndarray
: 请注意,
xrTensor
对象非常方便,因为xarray的DataArray
包含标签,因此您可以继续使用该对象而不是拉出ndarray
:
print(xrTensor)
Output: 输出:
<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1., 5.],
[ 3., 8.]],
[[ 2., 6.],
[nan, 9.]],
[[nan, 7.],
[ 4., nan]]]])
Coordinates:
* dim_1 (dim_1) object 'value'
* x (x) int64 1 2 3
* y (y) int64 1 2
* z (z) int64 1 2
We can using stack
我们可以使用
stack
np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))
Out[451]:
array([[[ 1., 2., nan],
[ 3., nan, 4.]],
[[ 5., 6., 7.],
[ 8., 9., nan]]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.