Python Pandas時間序列重新采樣產生意外結果

Question

此處的數據是具有余額的銀行帳戶。 我想對數據進行重新采樣以僅使用日末余額，因此要給出一天的最后一個值。 一天可能有多個數據點，代表多個交易。

In [1]: from StringIO import StringIO

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: print "Pandas version", pd.__version__
Pandas version 0.12.0

In [5]: print "Numpy version", np.__version__
Numpy version 1.7.1

In [6]: data_string = StringIO(""""Date","Balance"
   ...: "08/09/2013","1000"
   ...: "08/09/2013","950"
   ...: "08/09/2013","930"
   ...: "08/06/2013","910"
   ...: "08/02/2013","900"
   ...: "08/01/2013","88"
   ...: "08/01/2013","87"
   ...: """)

In [7]: ts = pd.read_csv(data_string, parse_dates=[0], index_col=0)

In [8]: print ts
            Balance
Date               
2013-08-09     1000
2013-08-09      950
2013-08-09      930
2013-08-06      910
2013-08-02      900
2013-08-01       88
2013-08-01       87

我希望“ 2013-08-09”為1000，但絕對不是“中間”數字950。

In [10]: ts.Balance.resample('D', how='last')
Out[10]: 
Date
2013-08-01     88
2013-08-02    900
2013-08-03    NaN
2013-08-04    NaN
2013-08-05    NaN
2013-08-06    910
2013-08-07    NaN
2013-08-08    NaN
2013-08-09    950
Freq: D, dtype: float64

我希望“ 2013-08-09”為930，或“ 2013-08-01”為88。

In [12]: ts.Balance.resample('D', how='first')
Out[12]: 
Date
2013-08-01      87
2013-08-02     900
2013-08-03     NaN
2013-08-04     NaN
2013-08-05     NaN
2013-08-06     910
2013-08-07     NaN
2013-08-08     NaN
2013-08-09    1000
Freq: D, dtype: float64

我在這里想念什么嗎？ 用“ first”和“ last”進行重采樣是否不符合我的期望？

Answer 1

為了能夠對您的數據重新采樣，Pandas首先必須對其進行排序。 因此，如果加載數據並按索引對其進行排序，則會得到以下結果：

>>> pd.read_csv(data_string, parse_dates=[0], index_col=0).sort_index()
            Balance
Date               
2013-08-01       87
2013-08-01       88
2013-08-02      900
2013-08-06      910
2013-08-09     1000
2013-08-09      930
2013-08-09      950

這就解釋了為什么獲得結果的原因。 @Jeff解釋了為什么順序是“任意的”，並且根據您的評論，解決方案是在操作之前對數據使用mergesort算法...

>>> df = pd.read_csv(data_string, parse_dates=[0],
                     index_col=0).sort_index(kind='mergesort')
>>> df.Balance.resample('D',how='last')
2013-08-01      88
2013-08-02     900
2013-08-03     NaN
2013-08-04     NaN
2013-08-05     NaN
2013-08-06     910
2013-08-07     NaN
2013-08-08     NaN
2013-08-09    1000
>>> df.Balance.resample('D', how='first')
2013-08-01     87
2013-08-02    900
2013-08-03    NaN
2013-08-04    NaN
2013-08-05    NaN
2013-08-06    910
2013-08-07    NaN
2013-08-08    NaN
2013-08-09    930

Answer 2

問題在於，您的日期是重復的，實際上可以有一個任意順序； 不保證可以配偶。

In [24]: ts.Balance.resample('D',how='last')
Out[24]: 
Date
2013-08-01     87
2013-08-02    900
2013-08-03    NaN
2013-08-04    NaN
2013-08-05    NaN
2013-08-06    910
2013-08-07    NaN
2013-08-08    NaN
2013-08-09    930
Freq: D, dtype: float64

In [25]: ts.Balance.order().resample('D',how='last')
Out[25]: 
Date
2013-08-01      88
2013-08-02     900
2013-08-03     NaN
2013-08-04     NaN
2013-08-05     NaN
2013-08-06     910
2013-08-07     NaN
2013-08-08     NaN
2013-08-09    1000
Freq: D, dtype: float64

最簡單的方法是sort數據進行sort ，但尚不清楚實際的順序是什么（例如，您需要一個外部參數來確定它）。

將sort=False傳遞給groupby（但是您不能通過重采樣來做到這一點）

In [29]: ts.groupby(ts.index,sort=False).last().reindex(date_range(ts.index.min(),ts.index.max()))
Out[29]: 
            Balance
2013-08-01       87
2013-08-02      900
2013-08-03      NaN
2013-08-04      NaN
2013-08-05      NaN
2013-08-06      910
2013-08-07      NaN
2013-08-08      NaN
2013-08-09      930

您可以通過這種方式來獲得您想要的東西

In [52]: df = DataFrame(ts.values,index=ts.index,columns=['values']).reset_index()

In [53]: df
Out[53]: 
                 Date  values
0 2013-08-09 00:00:00    1000
1 2013-08-09 00:00:00     950
2 2013-08-09 00:00:00     930
3 2013-08-06 00:00:00     910
4 2013-08-02 00:00:00     900
5 2013-08-01 00:00:00      88
6 2013-08-01 00:00:00      87

In [54]: df.groupby('Date').apply(lambda x: x.iloc[-1]['values']).reindex(date_range(ts.index.min(),ts.index.max()))

Out[54]: 
2013-08-01     87
2013-08-02    900
2013-08-03    NaN
2013-08-04    NaN
2013-08-05    NaN
2013-08-06    910
2013-08-07    NaN
2013-08-08    NaN
2013-08-09    930
Freq: D, dtype: float64

Python Pandas時間序列重新采樣產生意外結果

問題描述

2 個解決方案

解決方案1
2 已采納 2013-08-23 20:42:53

解決方案2
0 2013-08-23 20:39:14

Python Pandas時間序列重新采樣產生意外結果

問題描述

2 個解決方案

解決方案1 2 已采納 2013-08-23 20:42:53

解決方案2 0 2013-08-23 20:39:14

解決方案1
2 已采納 2013-08-23 20:42:53

解決方案2
0 2013-08-23 20:39:14