簡體   English   中英

如何按 NAN 值分割 Pandas 時間序列

[英]How to split a pandas time-series by NAN values

我有一個 Pandas TimeSeries,它看起來像這樣:

2007-02-06 15:00:00    0.780
2007-02-06 16:00:00    0.125
2007-02-06 17:00:00    0.875
2007-02-06 18:00:00      NaN
2007-02-06 19:00:00    0.565
2007-02-06 20:00:00    0.875
2007-02-06 21:00:00    0.910
2007-02-06 22:00:00    0.780
2007-02-06 23:00:00      NaN
2007-02-07 00:00:00      NaN
2007-02-07 01:00:00    0.780
2007-02-07 02:00:00    0.580
2007-02-07 03:00:00    0.880
2007-02-07 04:00:00    0.791
2007-02-07 05:00:00      NaN   

每次出現一個或多個 NaN 值時,我都想拆分 Pandas TimeSeries。 目標是我將事件分開。

Event1:
2007-02-06 15:00:00    0.780
2007-02-06 16:00:00    0.125
2007-02-06 17:00:00    0.875

Event2:
2007-02-06 19:00:00    0.565
2007-02-06 20:00:00    0.875
2007-02-06 21:00:00    0.910
2007-02-06 22:00:00    0.780

我可以遍歷每一行,但是否也有一種聰明的方法來做到這一點???

您可以使用numpy.split然后過濾結果列表。 下面是一個示例,假設帶有值的列被標記為"value"

events = np.split(df, np.where(np.isnan(df.value))[0])
# removing NaN entries
events = [ev[~np.isnan(ev.value)] for ev in events if not isinstance(ev, np.ndarray)]
# removing empty DataFrames
events = [ev for ev in events if not ev.empty]

您將擁有一個列表,其中包含由NaN值分隔的所有事件。

請注意,此答案適用於熊貓<0.25.0,如果您使用的是 0.25.0 或更高版本,請參閱thesofakillers 的此答案


我為非常大和稀疏的數據集找到了一個有效的解決方案。 就我而言,在NaN值之間只有十幾個簡短的數據段的數十萬行。 我 (ab) 使用了pandas.SparseIndex的內部pandas.SparseIndex ,這是一個幫助壓縮內存中稀疏數據集的功能。

給定一些數據:

import pandas as pd
import numpy as np

# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)

# Three blocks of non-null data throughout timeseries
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)

看起來像:

2011-01-01 00:00:00   NaN
2011-01-01 00:00:01   NaN
2011-01-01 00:00:02   NaN
2011-01-01 00:00:03   NaN
                       ..
2011-01-10 23:59:56   NaN
2011-01-10 23:59:57   NaN
2011-01-10 23:59:58   NaN
2011-01-10 23:59:59   NaN
Freq: S, Length: 864000, dtype: float64

我們可以有效且輕松地找到塊:

# Convert to sparse then query index to find block locations
sparse_ts = dense_ts.to_sparse()
block_locs = zip(sparse_ts.sp_index.blocs, sparse_ts.sp_index.blengths)

# Map the sparse blocks back to the dense timeseries
blocks = [dense_ts.iloc[start:(start + length - 1)] for (start, length) in block_locs]

瞧:

[2011-01-01 00:08:20    0.531793
 2011-01-01 00:08:21    0.484391
 2011-01-01 00:08:22    0.022686
 2011-01-01 00:08:23   -0.206495
 2011-01-01 00:08:24    1.472209
 2011-01-01 00:08:25   -1.261940
 2011-01-01 00:08:26   -0.696388
 2011-01-01 00:08:27   -0.219316
 2011-01-01 00:08:28   -0.474840
 Freq: S, dtype: float64, 2011-01-01 03:20:00   -0.147190
 2011-01-01 03:20:01    0.299565
 2011-01-01 03:20:02   -0.846878
 2011-01-01 03:20:03   -0.100975
 2011-01-01 03:20:04    1.288872
 2011-01-01 03:20:05   -0.092474
 2011-01-01 03:20:06   -0.214774
 2011-01-01 03:20:07   -0.540479
 2011-01-01 03:20:08   -0.661083
 2011-01-01 03:20:09    1.129878
 2011-01-01 03:20:10    0.791373
 2011-01-01 03:20:11    0.119564
 2011-01-01 03:20:12    0.345459
 2011-01-01 03:20:13   -0.272132
 Freq: S, dtype: float64, 2011-01-01 05:33:20    1.028268
 2011-01-01 05:33:21    1.476468
 2011-01-01 05:33:22    1.308881
 2011-01-01 05:33:23    1.458202
 2011-01-01 05:33:24   -0.874308
                              ..
 2011-01-01 05:34:02    0.941446
 2011-01-01 05:34:03   -0.996767
 2011-01-01 05:34:04    1.266660
 2011-01-01 05:34:05   -0.391560
 2011-01-01 05:34:06    1.498499
 2011-01-01 05:34:07   -0.636908
 2011-01-01 05:34:08    0.621681
 Freq: S, dtype: float64]

對於任何正在尋找未棄用 (pandas>=0.25.0) 版本的bloudermilk答案的人,在對pandas 稀疏源代碼進行了一些挖掘之后,我想出了以下內容。 我試圖使其盡可能與他們的答案相似,以便您可以比較:

給定一些數據:

import pandas as pd
import numpy as np

# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')

# NaN data interspersed with 3 blocks of non-NaN data
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)

看起來像:

2011-01-01 00:00:00   NaN
2011-01-01 00:00:01   NaN
2011-01-01 00:00:02   NaN
2011-01-01 00:00:03   NaN
2011-01-01 00:00:04   NaN
                       ..
2011-01-10 23:59:55   NaN
2011-01-10 23:59:56   NaN
2011-01-10 23:59:57   NaN
2011-01-10 23:59:58   NaN
2011-01-10 23:59:59   NaN
Freq: S, Length: 864000, dtype: float64

我們可以有效且輕松地找到塊:

# Convert to sparse then query index to find block locations
# different way of converting to sparse in pandas>=0.25.0
sparse_ts = dense_ts.astype(pd.SparseDtype('float'))
# we need to use .values.sp_index.to_block_index() in this version of pandas
block_locs = zip(
    sparse_ts.values.sp_index.to_block_index().blocs,
    sparse_ts.values.sp_index.to_block_index().blengths,
)
# Map the sparse blocks back to the dense timeseries
blocks = [
    dense_ts.iloc[start : (start + length - 1)]
    for (start, length) in block_locs
]

> blocks
[2011-01-01 00:08:20    0.092338
 2011-01-01 00:08:21    1.196703
 2011-01-01 00:08:22    0.936586
 2011-01-01 00:08:23   -0.354768
 2011-01-01 00:08:24   -0.209642
 2011-01-01 00:08:25   -0.750103
 2011-01-01 00:08:26    1.344343
 2011-01-01 00:08:27    1.446148
 2011-01-01 00:08:28    1.174443
 Freq: S, dtype: float64,
 2011-01-01 03:20:00    1.327026
 2011-01-01 03:20:01   -0.431162
 2011-01-01 03:20:02   -0.461407
 2011-01-01 03:20:03   -1.330671
 2011-01-01 03:20:04   -0.892480
 2011-01-01 03:20:05   -0.323433
 2011-01-01 03:20:06    2.520965
 2011-01-01 03:20:07    0.140757
 2011-01-01 03:20:08   -1.688278
 2011-01-01 03:20:09    0.856346
 2011-01-01 03:20:10    0.013968
 2011-01-01 03:20:11    0.204514
 2011-01-01 03:20:12    0.287756
 2011-01-01 03:20:13   -0.727400
 Freq: S, dtype: float64,
 2011-01-01 05:33:20   -1.409744
 2011-01-01 05:33:21    0.338251
 2011-01-01 05:33:22    0.215555
 2011-01-01 05:33:23   -0.309874
 2011-01-01 05:33:24    0.753737
 2011-01-01 05:33:25   -0.349966
 2011-01-01 05:33:26    0.074758
 2011-01-01 05:33:27   -1.574485
 2011-01-01 05:33:28    0.595844
 2011-01-01 05:33:29   -0.670004
 2011-01-01 05:33:30    1.655479
 2011-01-01 05:33:31   -0.362853
 2011-01-01 05:33:32    0.167355
 2011-01-01 05:33:33    0.703780
 2011-01-01 05:33:34    2.633756
 2011-01-01 05:33:35    1.898891
 2011-01-01 05:33:36   -1.129365
 2011-01-01 05:33:37   -0.765057
 2011-01-01 05:33:38    0.279869
 2011-01-01 05:33:39    1.388705
 2011-01-01 05:33:40   -1.420761
 2011-01-01 05:33:41    0.455692
 2011-01-01 05:33:42    0.367106
 2011-01-01 05:33:43    0.856598
 2011-01-01 05:33:44    1.920748
 2011-01-01 05:33:45    0.648581
 2011-01-01 05:33:46   -0.606784
 2011-01-01 05:33:47   -0.246285
 2011-01-01 05:33:48   -0.040520
 2011-01-01 05:33:49    1.422764
 2011-01-01 05:33:50   -1.686568
 2011-01-01 05:33:51    1.282430
 2011-01-01 05:33:52    1.358482
 2011-01-01 05:33:53   -0.998765
 2011-01-01 05:33:54   -0.009527
 2011-01-01 05:33:55    0.647671
 2011-01-01 05:33:56   -1.098435
 2011-01-01 05:33:57   -0.638245
 2011-01-01 05:33:58   -1.820668
 2011-01-01 05:33:59    0.768250
 2011-01-01 05:34:00   -1.029975
 2011-01-01 05:34:01   -0.744205
 2011-01-01 05:34:02    1.627130
 2011-01-01 05:34:03    2.058689
 2011-01-01 05:34:04   -1.194971
 2011-01-01 05:34:05    1.293214
 2011-01-01 05:34:06    0.029523
 2011-01-01 05:34:07   -0.405785
 2011-01-01 05:34:08    0.837123
 Freq: S, dtype: float64]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM