简体   繁体   English

何时在 Pandas 中使用多索引与 xarray

[英]When to use multiindexing vs. xarray in pandas

The pandas pivot tables documentation seems to recomend dealing with more than two dimensions of data by using multiindexing: pandas 数据透视表文档似乎建议通过使用多索引来处理多于两个维度的数据:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import pandas.util.testing as tm; tm.N = 3

In [4]: def unpivot(frame):
   ...:         N, K = frame.shape
   ...:         data = {'value' : frame.values.ravel('F'),
   ...:                 'variable' : np.asarray(frame.columns).repeat(N),
   ...:                 'date' : np.tile(np.asarray(frame.index), K)}
   ...:         return pd.DataFrame(data, columns=['date', 'variable', 'value'])
   ...: 

In [5]: df = unpivot(tm.makeTimeDataFrame())

In [6]: df
Out[6]: 
         date variable     value    value2
0  2000-01-03        A  0.462461  0.924921
1  2000-01-04        A -0.517911 -1.035823
2  2000-01-05        A  0.831014  1.662027
3  2000-01-03        B -0.492679 -0.985358
4  2000-01-04        B -1.234068 -2.468135
5  2000-01-05        B  1.725218  3.450437
6  2000-01-03        C  0.453859  0.907718
7  2000-01-04        C -0.763706 -1.527412
8  2000-01-05        C  0.839706  1.679413
9  2000-01-03        D -0.048108 -0.096216
10 2000-01-04        D  0.184461  0.368922
11 2000-01-05        D -0.349496 -0.698993

In [7]: df['value2'] = df['value'] * 2

In [8]: df.pivot('date', 'variable')
Out[8]: 
               value                                  value2            \
variable           A         B         C         D         A         B   
date                                                                     
2000-01-03 -1.558856 -1.144732 -0.234630 -1.252482 -3.117712 -2.289463   
2000-01-04 -1.351152 -0.173595  0.470253 -1.181006 -2.702304 -0.347191   
2000-01-05  0.151067 -0.402517 -2.625085  1.275430  0.302135 -0.805035   


variable           C         D  
date                            
2000-01-03 -0.469259 -2.504964  
2000-01-04  0.940506 -2.362012  
2000-01-05 -5.250171  2.550861  

I thought that xarray was made for handling multidimensional datasets like this:我认为 xarray 是为处理这样​​的多维数据集而设计的:

In [9]: import xarray as xr

In [10]: xr.DataArray(dict([(var, df[df.variable==var].drop('variable', 1)) for var in np.unique(df.variable)]))
Out[10]: 
<xarray.DataArray ()>
array({'A':         date     value    value2
0 2000-01-03  0.462461  0.924921
1 2000-01-04 -0.517911 -1.035823
2 2000-01-05  0.831014  1.662027, 'C':         date     value    value2
6 2000-01-03  0.453859  0.907718
7 2000-01-04 -0.763706 -1.527412
8 2000-01-05  0.839706  1.679413, 'B':         date     value    value2
3 2000-01-03 -0.492679 -0.985358
4 2000-01-04 -1.234068 -2.468135
5 2000-01-05  1.725218  3.450437, 'D':          date     value    value2
9  2000-01-03 -0.048108 -0.096216
10 2000-01-04  0.184461  0.368922
11 2000-01-05 -0.349496 -0.698993}, dtype=object)

Is one of these approaches better than the other?这些方法中的一种比另一种更好吗? Why hasn't xarray completely replaced multiindexing?为什么 xarray 没有完全取代多索引?

There does seem to be a transition to xarray for doing work on multi-dimensional arrays.似乎确实有过渡到 xarray 来处理多维数组。 Pandas will be depreciating the support for the 3D Panels data structure and in the documentation even suggest using xarray for working with multidemensional arrays : Pandas 将贬低对 3D Panels 数据结构的支持,并且在文档中甚至建议使用 xarray 来处理多维数组

'Oftentimes, one can simply use a MultiIndex DataFrame for easily working with higher dimensional data. '通常,人们可以简单地使用 MultiIndex DataFrame 来轻松处理更高维度的数据。

In addition, the xarray package was built from the ground up, specifically in order to support the multi-dimensional analysis that is one of Panel s main use cases.此外,xarray 包是从头开始构建的,特别是为了支持多维分析,这是 Panel 的主要用例之一。 Here is a link to the xarray panel-transition documentation.'这是 xarray 面板转换文档的链接。

From the xarray documentation they state their aims and goals:xarray 文档中,他们陈述了他们的目的和目标:

xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data... xarray 旨在提供一个与 Pandas 一样强大的数据分析工具包,但设计用于处理同构 N 维数组而不是表格数据......

...Our target audience is anyone who needs N-dimensional labelled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF ...我们的目标受众是任何需要 N 维标记数组的人,但我们特别关注物理科学家的数据分析需求——尤其是已经了解并热爱 netCDF 的地球科学家

The main advantage of xarray over using straight numpy is that it makes use of labels in the same way pandas does over multiple dimensions.与使用直接 numpy 相比,xarray 的主要优势在于它使用标签的方式与 Pandas 在多个维度上的使用方式相同。 If you are working with 3-dimensional data using multi-indexing or xarray might be interchangeable.如果您正在使用多索引或 xarray 处理 3 维数据,则可能可以互换。 As the number of dimensions grows in your data set xarray becomes much more manageable.随着数据集中维数的增加,xarray 变得更加易于管理。 I cannot comment on how each performs in terms of efficiency or speed.我无法评论每个人在效率或速度方面的表现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM