import numpy as np
import pandas as pd
import xarray as xr
validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()
In the above example, the time-series data ds
(or df
) has 30 randomly chosen missing records without having those as NaNs. Therefore, the length of data is 365x5 - 30, not 365x5).
My question is this: how can I expand the ds
and df
to have the 30 missing values as NaNs (so, the length will be 365x5)? For example, if a value in "2000-12-02" is missed in the example data, then it will look like:
...
2000-12-01 value 1
2000-12-03 value 2
...
, while what I want to have is:
...
2000-12-01 value 1
2000-12-02 NaN
2000-12-03 value 2
...
Perhaps you can try resample
with 1 hour.
The df
without NaNs (just after df = ds.to_dataframe()
):
>>> df
foo
time
2000-01-01 00:00:00 0
2000-01-01 01:00:00 1
2000-01-01 02:00:00 2
2000-01-01 03:00:00 3
2000-01-01 04:00:00 4
... ...
2000-03-16 20:00:00 1820
2000-03-16 21:00:00 1821
2000-03-16 22:00:00 1822
2000-03-16 23:00:00 1823
2000-03-17 00:00:00 1824
[1795 rows x 1 columns]
The df
with NaNs ( df_1h
):
>>> df_1h = df.resample('1H').mean()
>>> df_1h
foo
time
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 4.0
... ...
2000-03-16 20:00:00 1820.0
2000-03-16 21:00:00 1821.0
2000-03-16 22:00:00 1822.0
2000-03-16 23:00:00 1823.0
2000-03-17 00:00:00 1824.0
[1825 rows x 1 columns]
Rows with NaN:
>>> df_1h[df_1h['foo'].isna()]
foo
time
2000-01-02 10:00:00 NaN
2000-01-04 07:00:00 NaN
2000-01-05 06:00:00 NaN
2000-01-09 02:00:00 NaN
2000-01-13 15:00:00 NaN
2000-01-16 16:00:00 NaN
2000-01-18 21:00:00 NaN
2000-01-21 22:00:00 NaN
2000-01-23 19:00:00 NaN
2000-01-24 01:00:00 NaN
2000-01-24 19:00:00 NaN
2000-01-27 12:00:00 NaN
2000-01-27 16:00:00 NaN
2000-01-29 06:00:00 NaN
2000-02-02 01:00:00 NaN
2000-02-06 13:00:00 NaN
2000-02-09 11:00:00 NaN
2000-02-15 12:00:00 NaN
2000-02-15 15:00:00 NaN
2000-02-21 04:00:00 NaN
2000-02-28 05:00:00 NaN
2000-02-28 06:00:00 NaN
2000-03-01 15:00:00 NaN
2000-03-02 18:00:00 NaN
2000-03-04 18:00:00 NaN
2000-03-05 20:00:00 NaN
2000-03-12 08:00:00 NaN
2000-03-13 20:00:00 NaN
2000-03-16 01:00:00 NaN
The number of NaNs in df_1h
:
>>> df_1h.isnull().sum()
foo 30
dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.