简体   繁体   English

使用Pandas pd.pivot_table按日期进行透视

[英]Using Pandas pd.pivot_table to pivot by date

I'm still very new to pandas and python, and I'm afraid I'm doing something foolish here. 我对熊猫和python还是很陌生,恐怕我在这里做一些愚蠢的事情。 That said, the closest thing I could find to the problem I'm encountering is here How to create pivot with totals (margins) in Pandas? 就是说,我能找到的最接近我所遇到的问题的地方是如何在Pandas中使用总计(边距)创建枢轴? , so I am asking. ,所以我问。

I've got a simple dataframe with 3 columns. 我有一个包含3列的简单数据框。

  Account ID Amount Close Date
0         10a    100 2009-01-01
1         10a     50 2009-01-01
2         10a    100 2010-04-01
3         10a    100 2011-04-01
4         10a    100 2012-05-01
..        ...    ...        ...
35         4b     .5 2009-01-01
36         4c     .5 2009-01-01
37         5a     .5 2009-01-01
38         5b     .5 2009-01-01
39         8a     .5 2009-01-01

I think I'm having trouble with the close date column. 我想我在截止日期栏上遇到了麻烦。 I suspect that somehow pandas doesn't realize that 2009-01-01 equals another 2009-01-01. 我怀疑大熊猫没有意识到2009-01-01等于另一个2009-01-01。

I'd like to pivot this table to get output like this, where I can see things grouped first by account id and then by close date. 我想透视此表以获取这样的输出,在这里我可以看到按帐户ID然后按截止日期分组的内容。 If an account id has multiple rows with the same close date, I'd like those amounts to be added up in the values column, like this. 如果一个帐户ID有多行具有相同的关闭日期,那么我希望将这些金额加到“值”列中,如下所示。 (For the record, I'm really only interested in the year, but in trouble shooting I've been trying to simplify as much as possible.) (根据记录,我真的只对这一年感兴趣,但是为了排除故障,我一直在尝试尽可能简化。)

Account ID Close Date 
2c          2009-01-01  100
            2011-01-01  100
10a         2009-01-01  150
            2010-04-01  100
...

I've tried a variety of things, and keep running into problems that make me thing I've got some kind of a date problem. 我已经尝试了各种方法,并不断遇到使我感到有些约会问题的问题。 Maybe I need to import a different library? 也许我需要导入其他库?

Here's my latest attempt: 这是我最近的尝试:

pd.pivot_table(opps, index=['Account ID'], columns = 'Close Date', values=['Amount'], aggfunc=np.su m) pd.pivot_table(opps, index=['Account ID'], columns = 'Close Date', values=['Amount'], aggfunc=np.su m)

and the output is very close to what I want. 输出非常接近我想要的

The only problem is that for any account id that has two rows for a date, that data just disappears in the output. 唯一的问题是,对于日期有两行的任何帐户ID,该数据只会在输出中消失。 Account 10a has 3 rows for 2009-01-01, but in the pivot table shows 2009-01-01 Nan. 帐户10a在2009-01-01中有3行,但数据透视表中显示的是2009-01-01 Nan。

I thought I'd try the same pivot table with margins = True. 我以为我会尝试使用margins = True的相同数据透视表。

When I did that, I got an error message. 当我这样做时,我收到一条错误消息。

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-182-f8dc0d75c868> in <module>()
      3                margins = "True",
      4                values=['Amount'],
----> 5                aggfunc=np.sum)

/Applications/anaconda/lib/python2.7/site-packages/pandas/tools/pivot.pyc in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna)
    141     if margins:
    142         table = _add_margins(table, data, values, rows=index,
--> 143                              cols=columns, aggfunc=aggfunc)
    144 
    145     # discard the top level

/Applications/anaconda/lib/python2.7/site-packages/pandas/tools/pivot.pyc in _add_margins(table, data, values, rows, cols, aggfunc)
    167 
    168     if values:
--> 169         marginal_result_set = _generate_marginal_results(table, data, values, rows, cols, aggfunc, grand_margin)
    170         if not isinstance(marginal_result_set, tuple):
    171             return marginal_result_set

/Applications/anaconda/lib/python2.7/site-packages/pandas/tools/pivot.pyc in _generate_marginal_results(table, data, values, rows, cols, aggfunc, grand_margin)
    236                 # we are going to mutate this, so need to copy!
    237                 piece = piece.copy()
--> 238                 piece[all_key] = margin[key]
    239 
    240                 table_pieces.append(piece)

/Applications/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1795             return self._getitem_multilevel(key)
   1796         else:
-> 1797             return self._getitem_column(key)
   1798 
   1799     def _getitem_column(self, key):

/Applications/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1802         # get column
   1803         if self.columns.is_unique:
-> 1804             return self._get_item_cache(key)
   1805 
   1806         # duplicate columns & possible reduce dimensionaility

/Applications/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1082         res = cache.get(item)
   1083         if res is None:
-> 1084             values = self._data.get(item)
   1085             res = self._box_item_values(item, values)
   1086             cache[item] = res

/Applications/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   2849 
   2850             if not isnull(item):
-> 2851                 loc = self.items.get_loc(item)
   2852             else:
   2853                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Applications/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key, method)
   1570         """
   1571         if method is None:
-> 1572             return self._engine.get_loc(_values_from_object(key))
   1573 
   1574         indexer = self.get_indexer([key], method=method)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)()

KeyError: Timestamp('2009-01-01 00:00:00')

Thanks for any advice you can offer. 感谢您提供的任何建议。

It sounds like a group by rather than a pivot table to me - your columns are fixed. 在我看来,这听起来像是一个分组依据,而不是数据透视表-您的列是固定的。

For ex.: 例如:

import pandas as pd
from datetime import date

df = pd.DataFrame(data=[['10a', 100, date(2009, 1, 1)],
                        ['10a', 50, date(2009, 1, 1)],
                        ['10a', 100, date(2010, 4, 1)],
                        ['10a', 100, date(2011, 4, 1)],
                        ['10a', 100, date(2012, 5, 1)],
                        ['4b', .5, date(2009, 1, 1)],
                        ['4c', .5, date(2009, 1, 1)],
                        ['5a', .5, date(2009, 1, 1)],
                        ['5b', .5, date(2009, 1, 1)],
                        ['8a', .5, date(2009, 1, 1)]],
                  columns=['Account ID', 'Amount', 'Close Date'])

df.groupby(['Account ID', 'Close Date']).sum()

gives: 得到:

                       Amount
Account ID Close Date        
10a        2009-01-01   150.0
           2010-04-01   100.0
           2011-04-01   100.0
           2012-05-01   100.0
4b         2009-01-01     0.5
4c         2009-01-01     0.5
5a         2009-01-01     0.5
5b         2009-01-01     0.5
8a         2009-01-01     0.5

Apologies if I've missed something. 抱歉,如果我错过了什么。

The equivalent with pivot table is: 与数据透视表等效的是:

df.pivot_table(index=['Account ID', 'Close Date'], values=['Amount'], aggfunc=np.sum)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM