以 numpy.array()、pandas.DataFrame() 或 xarray.DataSet() 的形式展开时间序列以包含作为 NaN 的缺失记录

Question

import numpy as np
import pandas as pd
import xarray as xr

validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()

In the above example, the time-series data ds (or df ) has 30 randomly chosen missing records without having those as NaNs.在上面的示例中，时间序列数据ds （或df ）有 30 条随机选择的缺失记录，而没有那些作为 NaN 的记录。 Therefore, the length of data is 365x5 - 30, not 365x5).因此，数据的长度是 365x5 - 30，而不是 365x5)。

My question is this: how can I expand the ds and df to have the 30 missing values as NaNs (so, the length will be 365x5)?我的问题是：如何扩展ds和df以将 30 个缺失值作为 NaN（因此，长度将为 365x5）？ For example, if a value in "2000-12-02" is missed in the example data, then it will look like:例如，如果示例数据中遗漏了“2000-12-02”中的值，则它将如下所示：

...
2000-12-01  value 1
2000-12-03  value 2
...

, while what I want to have is: ，而我想要的是：

...
2000-12-01  value 1
2000-12-02  NaN
2000-12-03  value 2
...

Answer 1

Perhaps you can try resample with 1 hour.也许您可以尝试用 1 小时resample 。

The df without NaNs (just after df = ds.to_dataframe() ):没有 NaN 的df （就在df = ds.to_dataframe()之后）：

>>> df
                      foo
time
2000-01-01 00:00:00     0
2000-01-01 01:00:00     1
2000-01-01 02:00:00     2
2000-01-01 03:00:00     3
2000-01-01 04:00:00     4
...                   ...
2000-03-16 20:00:00  1820
2000-03-16 21:00:00  1821
2000-03-16 22:00:00  1822
2000-03-16 23:00:00  1823
2000-03-17 00:00:00  1824

[1795 rows x 1 columns]

The df with NaNs ( df_1h ):带有 NaN 的df ( df_1h )：

>>> df_1h = df.resample('1H').mean()
>>> df_1h
                        foo
time
2000-01-01 00:00:00     0.0
2000-01-01 01:00:00     1.0
2000-01-01 02:00:00     2.0
2000-01-01 03:00:00     3.0
2000-01-01 04:00:00     4.0
...                     ...
2000-03-16 20:00:00  1820.0
2000-03-16 21:00:00  1821.0
2000-03-16 22:00:00  1822.0
2000-03-16 23:00:00  1823.0
2000-03-17 00:00:00  1824.0

[1825 rows x 1 columns]

Rows with NaN:带有 NaN 的行：

>>> df_1h[df_1h['foo'].isna()]
                     foo
time
2000-01-02 10:00:00  NaN
2000-01-04 07:00:00  NaN
2000-01-05 06:00:00  NaN
2000-01-09 02:00:00  NaN
2000-01-13 15:00:00  NaN
2000-01-16 16:00:00  NaN
2000-01-18 21:00:00  NaN
2000-01-21 22:00:00  NaN
2000-01-23 19:00:00  NaN
2000-01-24 01:00:00  NaN
2000-01-24 19:00:00  NaN
2000-01-27 12:00:00  NaN
2000-01-27 16:00:00  NaN
2000-01-29 06:00:00  NaN
2000-02-02 01:00:00  NaN
2000-02-06 13:00:00  NaN
2000-02-09 11:00:00  NaN
2000-02-15 12:00:00  NaN
2000-02-15 15:00:00  NaN
2000-02-21 04:00:00  NaN
2000-02-28 05:00:00  NaN
2000-02-28 06:00:00  NaN
2000-03-01 15:00:00  NaN
2000-03-02 18:00:00  NaN
2000-03-04 18:00:00  NaN
2000-03-05 20:00:00  NaN
2000-03-12 08:00:00  NaN
2000-03-13 20:00:00  NaN
2000-03-16 01:00:00  NaN

The number of NaNs in df_1h : df_1h中的 NaN 数量：

>>> df_1h.isnull().sum()
foo    30
dtype: int64

以 numpy.array()、pandas.DataFrame() 或 xarray.DataSet() 的形式展开时间序列以包含作为 NaN 的缺失记录

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-21 08:08:15

以 numpy.array()、pandas.DataFrame() 或 xarray.DataSet() 的形式展开时间序列以包含作为 NaN 的缺失记录

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-21 08:08:15

解决方案1
1 已采纳 2021-04-21 08:08:15