Expand a time series in the form of numpy.array(), pandas.DataFrame(), or xarray.DataSet() to contain the missing records as NaN

Question

import numpy as np
import pandas as pd
import xarray as xr

validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()

In the above example, the time-series data ds (or df ) has 30 randomly chosen missing records without having those as NaNs. Therefore, the length of data is 365x5 - 30, not 365x5).

My question is this: how can I expand the ds and df to have the 30 missing values as NaNs (so, the length will be 365x5)? For example, if a value in "2000-12-02" is missed in the example data, then it will look like:

...
2000-12-01  value 1
2000-12-03  value 2
...

, while what I want to have is:

...
2000-12-01  value 1
2000-12-02  NaN
2000-12-03  value 2
...

Answer 1

Perhaps you can try resample with 1 hour.

The df without NaNs (just after df = ds.to_dataframe() ):

>>> df
                      foo
time
2000-01-01 00:00:00     0
2000-01-01 01:00:00     1
2000-01-01 02:00:00     2
2000-01-01 03:00:00     3
2000-01-01 04:00:00     4
...                   ...
2000-03-16 20:00:00  1820
2000-03-16 21:00:00  1821
2000-03-16 22:00:00  1822
2000-03-16 23:00:00  1823
2000-03-17 00:00:00  1824

[1795 rows x 1 columns]

The df with NaNs ( df_1h ):

>>> df_1h = df.resample('1H').mean()
>>> df_1h
                        foo
time
2000-01-01 00:00:00     0.0
2000-01-01 01:00:00     1.0
2000-01-01 02:00:00     2.0
2000-01-01 03:00:00     3.0
2000-01-01 04:00:00     4.0
...                     ...
2000-03-16 20:00:00  1820.0
2000-03-16 21:00:00  1821.0
2000-03-16 22:00:00  1822.0
2000-03-16 23:00:00  1823.0
2000-03-17 00:00:00  1824.0

[1825 rows x 1 columns]

Rows with NaN:

>>> df_1h[df_1h['foo'].isna()]
                     foo
time
2000-01-02 10:00:00  NaN
2000-01-04 07:00:00  NaN
2000-01-05 06:00:00  NaN
2000-01-09 02:00:00  NaN
2000-01-13 15:00:00  NaN
2000-01-16 16:00:00  NaN
2000-01-18 21:00:00  NaN
2000-01-21 22:00:00  NaN
2000-01-23 19:00:00  NaN
2000-01-24 01:00:00  NaN
2000-01-24 19:00:00  NaN
2000-01-27 12:00:00  NaN
2000-01-27 16:00:00  NaN
2000-01-29 06:00:00  NaN
2000-02-02 01:00:00  NaN
2000-02-06 13:00:00  NaN
2000-02-09 11:00:00  NaN
2000-02-15 12:00:00  NaN
2000-02-15 15:00:00  NaN
2000-02-21 04:00:00  NaN
2000-02-28 05:00:00  NaN
2000-02-28 06:00:00  NaN
2000-03-01 15:00:00  NaN
2000-03-02 18:00:00  NaN
2000-03-04 18:00:00  NaN
2000-03-05 20:00:00  NaN
2000-03-12 08:00:00  NaN
2000-03-13 20:00:00  NaN
2000-03-16 01:00:00  NaN

The number of NaNs in df_1h :

>>> df_1h.isnull().sum()
foo    30
dtype: int64

Expand a time series in the form of numpy.array(), pandas.DataFrame(), or xarray.DataSet() to contain the missing records as NaN

Question

1 answers

solution1
1 ACCPTED 2021-04-21 08:08:15

Expand a time series in the form of numpy.array(), pandas.DataFrame(), or xarray.DataSet() to contain the missing records as NaN

Question

1 answers

solution1 1 ACCPTED 2021-04-21 08:08:15

solution1
1 ACCPTED 2021-04-21 08:08:15