简体   繁体   English

将Pandas DataFrame转换为多维ndarray

[英]Convert a Pandas DataFrame to a multidimensional ndarray

I have a DataFrame with columns for the x, y, z coordinates and the value at this position and I want to convert this to a 3-dimensional ndarray. 我有一个DataFrame,其中包含x,y,z坐标和此位置的值,我希望将其转换为3维ndarray。

To make things more complicated, not all values exist in the DataFrame (these can just be replaced by NaN in the ndarray). 为了使事情变得更复杂,DataFrame中并不存在所有值(这些值可以在ndarray中替换为NaN)。

Just a simple example: 只是一个简单的例子:

df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2], 
                   'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
                   'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
                   'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})

Should result in the ndarray: 应该导致ndarray:

array([[[  1.,   2.,  nan],
        [  3.,  nan,   4.]],

       [[  5.,   6.,   7.],
        [  8.,   9.,  nan]]])

For two dimensions, this is easy: 对于两个维度,这很容易:

array = df.pivot_table(index="y", columns="x", values="value").as_matrix()

However, this method can not be applied to three or more dimensions. 但是,此方法不能应用于三维或更多维度。

Could you give me some suggestions? 你能给我一些建议吗?

Bonus points if this also works for more than three dimensions, handles multiple defined values (by taking the average) and ensures that all x, y, z coordinates are consecutive (by inserting row/columns of NaN when a coordinate is missing). 奖励点如果这也适用于三个以上的维度,处理多个定义的值(通过取平均值)并确保所有x,y,z坐标是连续的(通过在缺少坐标时插入NaN的行/列)。

EDIT: Some more explanations: 编辑:更多解释:

I read data from a CSV file which has the columns for x, y, z coordinates, optionally the frequency and the measurement value at this point and frequency. 我从CSV文件中读取数据,该文件包含x,y,z坐标列,可选地包括此点和频率的频率和测量值。 Then I round the coordinates to a specified precision (eg 0.1m) and want to get an ndarray which contains the averaged measurement values at each (rounded) coordinates. 然后我将坐标舍入到指定的精度(例如0.1m),并希望获得一个ndarray,其中包含每个(圆形)坐标处的平均测量值。 The indizes of the values do not need to coincide with the location. 值的指示不需要与位置一致。 However they need to be in the correct order. 但是他们需要按正确的顺序排列。

EDIT: I just ran a quick performance test: 编辑:我刚刚进行了快速的性能测试:

The solution of jakevdp takes 1.598s, Divikars solution takes 7.405s, JohnE's solution takes 7.867s and Wens solution takes 6.286s to complete. jakevdp的解决方案需要1.598秒,Divikars解决方案需要7.405秒,JohnE的解决方案需要7.867秒,Wens解决方案需要6.286秒才能完成。

You can use a groupby followed by the approach from Transform Pandas DataFrame with n-level hierarchical index into nD Numpy array : 您可以使用groupby然后使用带有n级层次索引的Transform Pandas DataFrame进入nD Numpy数组

grouped = df.groupby(['z', 'y', 'x'])['value'].mean()

# create an empty array of NaN of the right dimensions
shape = tuple(map(len, grouped.index.levels))
arr = np.full(shape, np.nan)

# fill it using Numpy's advanced indexing
arr[grouped.index.labels] = grouped.values.flat

print(arr)
# [[[  1.   2.  nan]
#   [  3.  nan   4.]]
# 
#  [[  5.   6.   7.]
#   [  8.   9.  nan]]]

Here's one NumPy approach - 这是一种NumPy方法 -

def dataframe_to_array_averaged(df):
    arr = df[['z','y','x']].values
    arr -= arr.min(0)
    out_shp = arr.max(0)+1

    L = np.prod(out_shp)

    val = df['value'].values
    ids = np.ravel_multi_index(arr.T, out_shp)

    avgs = np.bincount(ids, val, minlength=L)/np.bincount(ids, minlength=L)
    return avgs.reshape(out_shp)

Note that that this shows a warning because for places with no x,y,z triplets would have zero counts and hence the average values would be 0/0 = NaN , but since that's the expected output for those places, you can ignore the warning there. 请注意,这显示了一个警告,因为对于没有x,y,z三元组的地方将没有计数,因此平均值将是0/0 = NaN ,但由于这是这些地方的预期输出,您可以忽略警告那里。 To avoid this warning, we can employ indexing, as discussed in the second method (Alternative method). 为避免此警告,我们可以使用索引,如第二种方法(替代方法)中所述。

Sample run - 样品运行 -

In [106]: df
Out[106]: 
   value  x  y  z
0      1  1  1  1  # <=== this is repeated
1      2  2  1  1
2      3  1  2  1
3      4  3  2  1
4      5  1  1  2
5      6  2  1  2
6      7  3  1  2
7      8  1  2  2
8      9  2  2  2
9      4  1  1  1  # <=== this is repeated

In [107]: dataframe_to_array_averaged(df)
__main__:42: RuntimeWarning: invalid value encountered in divide
Out[107]: 
array([[[ 2.5,  2. ,  nan],
        [ 3. ,  nan,  4. ]],

       [[ 5. ,  6. ,  7. ],
        [ 8. ,  9. ,  nan]]])

Alternative method 替代方法

To avoid warning, an alternative way would be like so - 为了避免警告,另一种方式是这样的 -

out = np.full(out_shp,  np.nan)
sums = np.bincount(ids, val)
unq_ids, count = np.unique(ids, return_counts=1)
out.flat[:unq_ids[-1]] = sums
out.flat[unq_ids] /= count

Another solution is to use the xarray package: 另一个解决方案是使用xarray包:

import pandas as pd
import xarray as xr
df = pd.DataFrame({'x': [1, 2, 1, 3, 1, 2, 3, 1, 2], 
                   'y': [1, 1, 2, 2, 1, 1, 1, 2, 2],
                   'z': [1, 1, 1, 1, 2, 2, 2, 2, 2],
                   'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df = pd.pivot_table(df, values='value', index=['x', 'y', 'z'])
xrTensor = xr.DataArray(df).unstack("dim_0")
array = xrTensor.values[0].T
print(array)

Output: 输出:

array([[[ 1.,  2., nan],
        [ 3., nan,  4.]],

       [[ 5.,  6.,  7.],
        [ 8.,  9., nan]]])

Note that the xrTensor object is very handy since xarray's DataArray s contain the labels so you may just go on with that object rather pulling out the ndarray : 请注意, xrTensor对象非常方便,因为xarray的DataArray包含标签,因此您可以继续使用该对象而不是拉出ndarray

print(xrTensor)

Output: 输出:

<xarray.DataArray (dim_1: 1, x: 3, y: 2, z: 2)>
array([[[[ 1.,  5.],
         [ 3.,  8.]],

        [[ 2.,  6.],
         [nan,  9.]],

        [[nan,  7.],
         [ 4., nan]]]])
Coordinates:
  * dim_1    (dim_1) object 'value'
  * x        (x) int64 1 2 3
  * y        (y) int64 1 2
  * z        (z) int64 1 2

We can using stack 我们可以使用stack

np.reshape(df.groupby(['z', 'y', 'x'])['value'].mean().unstack([1,2]).stack([0,1],dropna=False).values,(2,2,3))


Out[451]: 
array([[[  1.,   2.,  nan],
        [  3.,  nan,   4.]],
       [[  5.,   6.,   7.],
        [  8.,   9.,  nan]]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM