简体   繁体   中英

Reshaping two-column data using pandas pivot

I am trying to reshape a long text file with two columns (a repeating date_time sequence and a single column of numerical values) into a Pandas dataframe with a single index of date_time and multiple columns of data. The actual file is 100 sets of 82 years of daily rainfall data (from a stochastic generator) and is some 3 million lines. I want to have 100 columns of rainfall data against the 82 x 365(366 leap year) date_time index. To simplify the exercise I present an example below (the four-row sequence representing a leap-year):

2014/01/01  1
2014/01/02  2
2014/01/03  3

2014/01/01  4
2014/01/02  5
2014/01/03  6
2014/01/04  7

2014/01/01  8
2014/01/02  9
2014/01/03  10

The desired output is something like:

              0    1    2
2014/01/01    1    4    8
2014/01/02    2    5    9
2014/01/03    3    6    10
2014/01/04    nan  7    nan

This seems excruciatingly simple but it has me beat. I have tried to turn the original series into a dataframe, then use the following but Pandas does not seem to like a single column:

df.pivot()

You should first create a new column that indicates in which column the value has to come.

Supposing you know the starting date of each sequence (and it is the same each time), you can eg do that like this:

In [7]: df['set'] = (df['date'] ==  '2014/01/01').cumsum()

In [8]: df
Out[8]: 
         date  value  set
0  2014/01/01      1    1
1  2014/01/02      2    1
2  2014/01/03      3    1
3  2014/01/01      4    2
4  2014/01/02      5    2
5  2014/01/03      6    2
6  2014/01/04      7    2
7  2014/01/01      8    3
8  2014/01/02      9    3
9  2014/01/03     10    3 

When you have this column, you can use pivot :

In [9]: df.pivot(index='date', columns='set', values='value')
Out[9]: 
set          1  2   3
date                 
2014/01/01   1  4   8
2014/01/02   2  5   9
2014/01/03   3  6  10
2014/01/04 NaN  7 NaN

EDIT: Thanks to DSM, another way to find the groups (and one where you don't have to know the first item of each group):

In [10]: df['date'] = pd.to_datetime(df['date'])

In [11]: df['set'] = (df['date'].diff().fillna(0) <= 0).cumsum()

This is based on the fact that when a new set starts, the time difference with the previous row will be negative (if the data is sorted by time).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM