簡體   English   中英

Pythonic方法延遲datetime-indexed列

[英]Pythonic way to lag datetime-indexed columns

我的數據幀具有各種類型的DateTime索引(可以是每周,每月,每年的數據)。 我想生成列為其他列的滯后值的列。 我從電子表格中導入了這些,我沒有在python中生成日期時間索引。

我正在努力尋找這種“pythonic”方式。 我想如果我使用Pandas的日期時間功能,那么在奇怪或異常數據的情況下,滯后可能會更強大。

我制作了一個似乎有效的玩具示例,但它在我的真實世界的例子中失敗了。

正確工作的玩具示例(創建一個具有上個月'foo'值的新列)

rng = pd.date_range('2012-01-01', '2013-1-01', freq="M")
toy2 = pd.DataFrame(pd.Series(np.random.randint(0,  50, len(rng)), index=rng, name="foo"))

            foo
2012-01-31    4
2012-02-29    2
2012-03-31   27
2012-04-30    7
2012-05-31   44
2012-06-30   22
2012-07-31   16
2012-08-31   18
2012-09-30   35
2012-10-31   35
2012-11-30   16
2012-12-31   32

toy2['lag_foo']= toy2['foo'].shift(1,'m')

    foo lag_foo
2012-01-31  4   NaN
2012-02-29  2   4.0
2012-03-31  27  2.0
2012-04-30  7   27.0
2012-05-31  44  7.0
2012-06-30  22  44.0
2012-07-31  16  22.0
2012-08-31  18  16.0
2012-09-30  35  18.0
2012-10-31  35  35.0
2012-11-30  16  35.0
2012-12-31  32  16.0

但是,當我在我的真實例子中運行它時,它失敗了:

ValueError:無法從重復軸重新索引

print type(toy)
print toy.columns
print toy['IPE m2'][0:5]

<class 'pandas.core.frame.DataFrame'>
Index([u'IPE m2'], dtype='object')
Date
2016-04-30    43.29
2016-03-31    40.44
2016-02-29    34.17
2016-01-31    32.47
2015-12-31    39.35
Name: IPE m2, dtype: float64

異常跟蹤:

ValueError                                Traceback (most recent call last)
<ipython-input-170-9cb57a2ed681> in <module>()
----> 1 toy['prev_1m']= toy['IPE m2'].shift(1,'m')

C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value)
   2355         else:
   2356             # set column
-> 2357             self._set_item(key, value)
   2358 
   2359     def _setitem_slice(self, key, value):

C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value)
   2421 
   2422         self._ensure_valid_index(value)
-> 2423         value = self._sanitize_column(key, value)
   2424         NDFrame._set_item(self, key, value)
   2425 

C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _sanitize_column(self, key, value)
   2555 
   2556         if isinstance(value, Series):
-> 2557             value = reindexer(value)
   2558 
   2559         elif isinstance(value, DataFrame):

C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in reindexer(value)
   2547                     # duplicate axis
   2548                     if not value.index.is_unique:
-> 2549                         raise e
   2550 
   2551                     # other

ValueError: cannot reindex from a duplicate axis

好像我錯過了一些我認為的熊貓日期時間指數的微妙之處。 另外,我甚至不確定這是否是理想的做法。 我唯一可以懷疑的是,非工作的toy.index具有None作為freq,而工作toy2示例的頻率設置為'M'

toy.index
DatetimeIndex(['2016-04-30', '2016-03-31', '2016-02-29', '2016-01-31',
               '2015-12-31', '2015-11-30', '2015-10-31', '2015-09-30',
               '2015-08-31', '2015-07-31',
               ...
                      'NaT',        'NaT',        'NaT',        'NaT',
                      'NaT',        'NaT',        'NaT',        'NaT',
                      'NaT',        'NaT'],
              dtype='datetime64[ns]', name=u'Date', length=142, freq=None)


toy2.index
DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
               '2012-05-31', '2012-06-30', '2012-07-31', '2012-08-31',
               '2012-09-30', '2012-10-31', '2012-11-30', '2012-12-31'],
              dtype='datetime64[ns]', freq='M')
In [ ]:

===========================

我扔掉了NaT的

toy = toy.dropna()

toy['prev_1m']= toy['IPE m2'].shift(1,'m')

我得到了我想要的結果。 但是,我也收到警告:

C:\Users\mds\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

====

這種分配方式抑制了警告:

toy.loc[:,'prev_1m2']= toy['IPE m2'].shift(1,'m')

還有另一個問題 - toy DataFrame中的索引中存在很多NaT ,因此index具有重復值。 (也許某些日期時間也是重復的。)

樣品:

import pandas as pd
import numpy as np

rng = pd.date_range('2012-01-01', '2013-1-01', freq="M")
toy2 = pd.DataFrame(pd.Series(np.random.randint(0,  50, len(rng)), index=rng, name="foo"))

df = pd.DataFrame({'foo': [10,30,19]}, index=[np.nan, np.nan, np.nan])
print (df)
     foo
NaN   10
NaN   30
NaN   19

toy2 = pd.concat([toy2, df])
print (toy2)
            foo
2012-01-31   18
2012-02-29   34
2012-03-31   43
2012-04-30   17
2012-05-31   45
2012-06-30    8
2012-07-31   36
2012-08-31   26
2012-09-30    5
2012-10-31   18
2012-11-30   39
2012-12-31    3
NaT          10
NaT          30
NaT          19

toy2['lag_foo']= toy2['foo'].shift(1,'m')
print (toy2)

ValueError:無法從重復軸重新索引

一種可能的解決方案是省略參數freq=m

toy2['lag_foo']= toy2['foo'].shift(1)
print (toy2)
            foo  lag_foo
2012-01-31   21      NaN
2012-02-29   13     21.0
2012-03-31   41     13.0
2012-04-30   38     41.0
2012-05-31   15     38.0
2012-06-30   41     15.0
2012-07-31   30     41.0
2012-08-31   18     30.0
2012-09-30   12     18.0
2012-10-31   35     12.0
2012-11-30   23     35.0
2012-12-31    7     23.0
NaT          10      7.0
NaT          30     10.0
NaT          19     30.0

如果需要刪除index包含NaNNaT )的所有記錄,請使用帶有boolean indexing notnull

print (toy2)
            foo
2012-01-31   41
2012-02-29   15
2012-03-31    8
2012-04-30    2
2012-05-31   16
2012-06-30   43
2012-07-31    2
2012-08-31   15
2012-09-30    3
2012-10-31   46
2012-11-30   34
2012-12-31   36
NaT          10
NaT          30
NaT          19

toy2 = toy2[pd.notnull(toy2.index)]

toy2['lag_foo']= toy2['foo'].shift(1, 'm')
print (toy2)
            foo  lag_foo
2012-01-31   41      NaN
2012-02-29   15     41.0
2012-03-31    8     15.0
2012-04-30    2      8.0
2012-05-31   16      2.0
2012-06-30   43     16.0
2012-07-31    2     43.0
2012-08-31   15      2.0
2012-09-30    3     15.0
2012-10-31   46      3.0
2012-11-30   34     46.0
2012-12-31   36     34.0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM