简体   繁体   English

Python Pandas — 在多列上融合、旋转、转置

[英]Python Pandas — melt, pivot, transpose on multiple columns

I have a dataframe that looks like shown below.我有一个如下所示的数据框。 Index is years (1964 to 2016, non-unique, each year repeats 31 times), 1st column is days (1 to 31) and columns 2 to 13 are months (1 to 12)索引是年(1964 年到 2016 年,非唯一,每年重复 31 次),第一列是天(1 到 31),第 2 到 13 列是月(1 到 12)

Question is: how do I convert this to a Pandas series (or single column df) with pd.DatetimeIndex dates?问题是:如何将其转换为带有 pd.DatetimeIndex 日期的 Pandas 系列(或单列 df)? I've tried using groupby, melt, pivot and transpose, but I am not able to figure out the correct syntax and the documentation is not clear.我试过使用 groupby、melt、pivot 和 transpose,但我无法找出正确的语法,文档也不清楚。 Thanks a lot for your help!非常感谢您的帮助!

数据框

We want to take advantage of the pd.to_datetime functionality that takes a dataframe with the relevantly named columns.我们希望利用pd.to_datetime功能,该功能采用具有相关命名列的数据pd.to_datetime In this case 'year' , 'month' , and 'day' .在本例中为'year''month''day'

So the solution below will aim to create such a dataframe with those three columns and pass it to pd.to_datetime .因此,下面的解决方案旨在使用这三列创建这样一个数据pd.to_datetime并将其传递给pd.to_datetime

  • We have 'year' in the index already... so let's get everything in the index.我们已经在索引中添加了'year' ...所以让我们获取索引中的所有内容。 Let's start with getting 'day' in the index with df.set_index('day', append=True)让我们从使用df.set_index('day', append=True)在索引中获取'day'开始
  • Next, we are going to get 'month' into the index.接下来,我们将把'month'放入索引中。 But right now it's in the columns.但现在它在列中。 First, we rename the columns with .rename_axis('month', 1)首先,我们使用.rename_axis('month', 1)重命名列
  • Then we put it in the index with .stack()然后我们用.stack()把它放在索引中
  • So now I have 3 columns of index values.所以现在我有 3 列索引值。 When I reset_index , I'm going to have 3 columns pushed onto the front of the dataframe.当我reset_index ,我reset_index 3 列推到数据reset_index的前面。 So, I'll reset_index and take the first three columns with .reset_index().iloc[:, :3] and pass that to pd.to_datetime因此,我将 reset_index 并使用.reset_index().iloc[:, :3]前三列并将其传递给pd.to_datetime
  • Since some combinations may not exist, like '1964-02-31' , we pass the errors='coerce' which will return NaT for such dates.由于某些组合可能不存在,例如'1964-02-31' ,我们传递errors='coerce' ,它将为此类日期返回NaT
  • Finally, we filter the result using loc and dropping null values from the index.最后,我们使用loc过滤结果并从索引中删除空值。

Sample data样本数据

df = pd.DataFrame({
    'day': [1, 2, 3], 1: [8, 5, 3]
}, pd.Index([1999, 1999, 1999], name='year'))

df

      day  1
year        
1999    1  8
1999    2  5
1999    3  3

Solution解决方案

s = df.set_index('day', append=True).rename_axis('month', 1).stack()
s.index = pd.to_datetime(s.reset_index().iloc[:, :3], errors='coerce')
s = s.loc[s.index.dropna()]

s

1999-01-01    8
1999-01-02    5
1999-01-03    3
dtype: int64

Full data完整数据

df = pd.DataFrame(
    np.arange(31 * 12).reshape(31, 12),
    pd.Index([1964 for _ in range(31)], name='year'),
    np.arange(12) + 1
).assign(day=np.arange(31) + 1).iloc[:, [-1] + np.arange(12).tolist()]

df

      day    1    2    3    4    5    6    7    8    9   10   11   12
year                                                                 
1964    1    0    1    2    3    4    5    6    7    8    9   10   11
1964    2   12   13   14   15   16   17   18   19   20   21   22   23
1964    3   24   25   26   27   28   29   30   31   32   33   34   35
1964    4   36   37   38   39   40   41   42   43   44   45   46   47
1964    5   48   49   50   51   52   53   54   55   56   57   58   59
1964    6   60   61   62   63   64   65   66   67   68   69   70   71
1964    7   72   73   74   75   76   77   78   79   80   81   82   83
1964    8   84   85   86   87   88   89   90   91   92   93   94   95
1964    9   96   97   98   99  100  101  102  103  104  105  106  107
1964   10  108  109  110  111  112  113  114  115  116  117  118  119
1964   11  120  121  122  123  124  125  126  127  128  129  130  131
1964   12  132  133  134  135  136  137  138  139  140  141  142  143
1964   13  144  145  146  147  148  149  150  151  152  153  154  155
1964   14  156  157  158  159  160  161  162  163  164  165  166  167
1964   15  168  169  170  171  172  173  174  175  176  177  178  179
1964   16  180  181  182  183  184  185  186  187  188  189  190  191
1964   17  192  193  194  195  196  197  198  199  200  201  202  203
1964   18  204  205  206  207  208  209  210  211  212  213  214  215
1964   19  216  217  218  219  220  221  222  223  224  225  226  227
1964   20  228  229  230  231  232  233  234  235  236  237  238  239
1964   21  240  241  242  243  244  245  246  247  248  249  250  251
1964   22  252  253  254  255  256  257  258  259  260  261  262  263
1964   23  264  265  266  267  268  269  270  271  272  273  274  275
1964   24  276  277  278  279  280  281  282  283  284  285  286  287
1964   25  288  289  290  291  292  293  294  295  296  297  298  299
1964   26  300  301  302  303  304  305  306  307  308  309  310  311
1964   27  312  313  314  315  316  317  318  319  320  321  322  323
1964   28  324  325  326  327  328  329  330  331  332  333  334  335
1964   29  336  337  338  339  340  341  342  343  344  345  346  347
1964   30  348  349  350  351  352  353  354  355  356  357  358  359
1964   31  360  361  362  363  364  365  366  367  368  369  370  371

s = df.set_index('day', append=True).rename_axis('month', 1).stack()
s.index = pd.to_datetime(s.reset_index().iloc[:, :3], errors='coerce')
s = s.loc[s.index.dropna()]

s

1964-01-01      0
1964-02-01      1
1964-03-01      2
1964-04-01      3
1964-05-01      4
1964-06-01      5
1964-07-01      6
1964-08-01      7
1964-09-01      8
1964-10-01      9
1964-11-01     10
1964-12-01     11
1964-01-02     12
1964-02-02     13
1964-03-02     14
...
1964-05-30    352
1964-06-30    353
1964-07-30    354
1964-08-30    355
1964-09-30    356
1964-10-30    357
1964-11-30    358
1964-12-30    359
1964-01-31    360
1964-03-31    362
1964-05-31    364
1964-07-31    366
1964-08-31    367
1964-10-31    369
1964-12-31    371
Length: 366, dtype: int64

Alternative替代品

lol = [[y, m, d] for y, d in zip(df.index, df.day) for m in df.columns[1:]]
columns = ['year', 'month', 'day']
d1 = pd.DataFrame(lol, columns=columns)
dates = pd.to_datetime(d1, errors='coerce')
m = dates.notnull().values

pd.Series(df.drop('day', 1).values.ravel()[m], dates[m])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM