简体   繁体   English

如何有效地将 Pandas dataframe 重新采样到 3d Z3B7F949B2343F9E57790E29F6EFZE 阵列中?

[英]How to efficiently resample a Pandas dataframe into 3d NumPy array?

I have a big data frame with a DatetimeIndex and multiple columns.我有一个包含DatetimeIndex和多列的大数据框。 Now I would like to have an operation resample_3d which can be used like this:现在我想要一个操作resample_3d可以这样使用:

index, array = df.resample_3d("1h", fill_value=0)

... and transforms the data frame ...并转换数据框

index | A | B | C | D
10:00 | 1 |   | 
10:01 | 1 |   | 
12:00 | 1 |   |
13:00 | 1 |   |

into a 3d-NumPy array of shape (3, 2, 4).成一个形状为 (3, 2, 4) 的 3d-NumPy 数组。 The first dimension is the time (which can be looked up in the separately returned index ), the second dimension is the row index within the "resample group" and the third dimension are the features.第一个维度是时间(可以在单独返回的index中查找),第二个维度是“重采样组”中的行索引,第三个维度是特征。 The size of the second dimension is equals the maximum rows in a single resample group.第二维的大小等于单个重采样组中的最大行数。 Unused entries are filled (eg with zeros).未使用的条目被填充(例如用零)。

Is there such a or a similar function in Pandas/another library or is there a way to implement something like this on top of Pandas efficiently without too much work? Pandas/另一个库中是否有这样或类似的 function 或者有没有办法在 Pandas 之上有效地实现类似的东西而无需太多工作?

I am aware that I could build something on top of df.resample().apply(list) , but this is way too slow for bigger data frames.我知道我可以在df.resample().apply(list)之上构建一些东西,但这对于更大的数据帧来说太慢了。

I have already started my own implementation with Numba, but then quickly realized that this is quite some work.我已经开始使用 Numba 实现自己的实现,但很快意识到这是一项相当大的工作。

(I have just discovered xarray and thought I tag this question with it because it may be the better base for doing this than Pandas.) (我刚刚发现 xarray 并认为我用它标记了这个问题,因为它可能是比 Pandas 更好的基础。)

It is unclear what is your data like, but yes, xarray might be what you search for.目前尚不清楚您的数据是什么样的,但是是的,xarray 可能是您搜索的内容。

Once your data is well-formatted as a DataArray , you can then just do:一旦您的数据被正确格式化为DataArray ,您就可以执行以下操作:

da.resample(time="1h")

It will return a DataArrayResample object.它将返回一个DataArrayResample object。

Usually, when resampling, the new coordinates grid doesn't match the previous grid.通常,重采样时,新的坐标网格与之前的网格不匹配。

Thus, from there, you need to apply one of the numerous methods of the DataArrayResample object to tell xarray how to fill this new grid.因此,从那里,您需要应用DataArrayResample object 的众多方法之一来告诉 xarray 如何填充这个新网格。

For example, you may want to interpolate values using the original data as knots:例如,您可能希望使用原始数据作为节点来插值:

da.resample(time="1h").interpolate("linear")

But you can also backfill, pad, use the nearest values etc.但您也可以回填、填充、使用最接近的值等。

If you don't want to fill the new grid, use .asfreq() and new times will be set to NaN.如果您不想填充新网格,请使用.asfreq()并将新时间设置为 NaN。 You'll still be able to interpolate later using interpolate_na() .您仍然可以稍后使用interpolate_na()进行插值。

Your case你的情况

In your case, it seems that you are doing a down-sampling, and thus that there is an exact match between new grid coordinates and original grid coordinates.在您的情况下,您似乎正在进行下采样,因此新网格坐标和原始网格坐标之间存在完全匹配。

So, methods that will work for you are any of .nearest() , .asfreq() , .interpolate() (note that .interpolate() will convert int to float ).因此,适合您的方法是.nearest().asfreq().interpolate()中的任何一个(请注意, .interpolate()会将int转换为float )。

However, since you are downsampling at exact grid knots, what you are really doing is selecting a subset of your array, so you might want to use the .sel() method instead.但是,由于您在精确的网格节点处进行下采样,因此您真正要做的是选择数组的一个子集,因此您可能希望改用.sel()方法。

Example例子

An example of down-sampling on exact grid points knots.对精确网格点节点进行下采样的示例。

Create the data:创建数据:

>>> dims = ("time", "features")
>>> sizes = (6, 3)
>>> h_step = 0.5

>>> da = xr.DataArray(
        dims=dims,
        data=np.arange(np.prod(sizes)).reshape(*sizes),
        coords=dict(
            time=pd.date_range(
                "04/07/2020",
                periods=sizes[0],
                freq=pd.DateOffset(hours=h_step),
            ),
            features=list(string.ascii_uppercase[: sizes[1]]),
        ),
    )

>>> da
<xarray.DataArray (time: 6, features: 3)>
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:30:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T00:30:00.000000000',
       '2020-04-07T01:00:00.000000000', 
       '2020-04-07T01:30:00.000000000',
       '2020-04-07T02:00:00.000000000',
       '2020-04-07T02:30:00.000000000'],
      dtype='datetime64[ns]')

Downsampling using .resample() and .nearest() :使用.resample().nearest()进行下采样:

>>> da.resample(time="1h").nearest()
<xarray.DataArray (time: 3, features: 3)>
array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:00:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.resample(time="1h").nearest().time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T01:00:00.000000000',
       '2020-04-07T02:00:00.000000000'],
      dtype='datetime64[ns]')

Down-sampling by selection:通过选择进行下采样:

>>> dwn_step = 2

>>> new_time = pd.date_range(
        "04/07/2020",
        periods=sizes[0] // dwn_step,
        freq=pd.DateOffset(hours=h_step * dwn_step),
    )

>>> da.sel(time=new_time)
<xarray.DataArray (time: 3, features: 3)>
array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])
Coordinates:
  * time      (time) datetime64[ns] 2020-04-07 ... 2020-04-07T02:00:00
  * features  (features) <U1 'A' 'B' 'C'

>>> da.sel(time=new_time).time.values
array(['2020-04-07T00:00:00.000000000',
       '2020-04-07T01:00:00.000000000',
       '2020-04-07T02:00:00.000000000'],
      dtype='datetime64[ns]')

Another option to create new_time index is to merely do:创建new_time索引的另一个选项是仅执行以下操作:

new_time = da.time[::dwn_coeff]

It is more straightforward, but you can't choose the first selected time (which can be either good or a bad, depending on your case).它更直接,但您不能选择第一个选择的时间(根据您的情况,这可能是好是坏)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM