改造一个pandas dataframe：需要更高效的解决方案

Question

I have a dataframe indexed by dates from a certain period.我有一个 dataframe 由某个时期的日期索引。 My columns are predictions about the value of a variable by the end of a given year.我的专栏是对给定年份年底变量值的预测。 My original dataframe looks something like this:我原来的 dataframe 看起来像这样：

            2016  2017  2018
2016-01-01   0.0     1   NaN
2016-07-01   1.0     1   4.1
2017-01-01   NaN     5   3.0
2017-07-01   NaN     2   2.0

where NaN means that the prediction does not exist for that given year.其中 NaN 表示该给定年份的预测不存在。

Since I am working with 20+ years and most predictions are for the next 2-3 years, my real dataframe has 20+ columns mostly containing NaN values.由于我工作了 20 多年，而且大多数预测都是针对未来 2-3 年的，所以我的真实 dataframe 有 20 多列，其中大部分包含NaN值。 For instance, the column for the year 2005 has predictions made in 2003-2005, but in the range 2006-2020 it's all NaN .例如，2005 年的列有 2003-2005 年的预测，但在 2006-2020 年的范围内都是NaN 。

I would like to transform my dataframe to something like this:我想将我的 dataframe 转换成这样：

            Y_0  Y_1  Y_2
2016-01-01    0    1  NaN
2016-07-01    1    1  4.1
2017-01,01    5    3  NaN
2017-07-01    2    2  NaN

where Y_j represents the prediction made for the year = index.year + j .其中Y_j表示对year = index.year + j所做的预测。 This way, I would have a dataframe with only 4 columns (Y_0, Y_1, Y_2, Y_3).这样，我将得到一个只有 4 列（Y_0、Y_1、Y_2、Y_3）的 dataframe。

I actually achieved this, but in what I think it is a very inefficient way:我实际上实现了这一点，但我认为这是一种非常低效的方式：


for i in range(4):
    df[f'Y_{i}'] = numpy.nan  # create columns [Y_0, Y_1, Y_2, Y_3]

for index, row in df.iterrows():  # iterate through each row of df
    
    for year in row.dropna().index:  # iterate through each year where a prediction exists
        
        year_diff = int(year) - index.year # get the difference between the years for which the prediction was made and when it was made (possible values: 0, 1, 2 or 3)
        
        df.loc[index, f'Y_{year_diff}'] = df.loc[index, year]  # set  the values for the columns 'Y_0', 'Y_1', 'Y_2' and 'Y_3' cell by cell.

        df = df.iloc[:, -4:]  # delete all but the new columns

For a dataframe with only 1000 rows, this is taking almost 3 seconds to run.对于只有 1000 行的 dataframe，这将花费将近 3 秒的时间来运行。 Can anyone think of a better solution?谁能想到更好的解决方案？

Answer 1

You could use melt to convert it to the long format then pivot back based on the year differences.您可以使用melt将其转换为长格式，然后根据年份差异将其转换回 pivot。

Using your DataFrame as an example:以您的 DataFrame 为例：

df = pd.DataFrame({'date':[datetime.date(2016, 1, 1), datetime.date(2016, 7, 1),
                      datetime.date(2017, 1, 1), datetime.date(2017, 7, 1)],
             2016:[0,1,np.nan,np.nan],
             2017:[1,1,5,2],
             2018:[np.nan, 4.1, 3, 2]})
df = df.melt(id_vars = 'date', value_vars = [2016, 2017, 2018], var_name='prediction_year', value_name='prediction')

Long format:长格式：

    date        prediction_year prediction
0   2016-01-01  2016    0.0
1   2016-07-01  2016    1.0
2   2017-01-01  2016    NaN
3   2017-07-01  2016    NaN
4   2016-01-01  2017    1.0
5   2016-07-01  2017    1.0
6   2017-01-01  2017    5.0
7   2017-07-01  2017    2.0
8   2016-01-01  2018    NaN
9   2016-07-01  2018    4.1
10  2017-01-01  2018    3.0
11  2017-07-01  2018    2.0

Convert back to the desired wide format:转换回所需的宽格式：

df['year'] = pd.to_datetime(df['date']).dt.year
df['dt'] = df['prediction_year'] - df['year']
df = df.pivot(index = 'date', columns='dt', values='prediction').dropna(axis = 1, how = 'all').add_prefix('Y_')

            Y_0 Y_1 Y_2
date            
2016-01-01  0.0 1.0 NaN
2016-07-01  1.0 1.0 4.1
2017-01-01  5.0 3.0 NaN
2017-07-01  2.0 2.0 NaN

Answer 2

Let's try stack then calculate the year difference:让我们尝试stack然后计算年份差异：

# in index is not already datetime
df.index = pd.to_datetime(df.index)

df = (df.stack().reset_index()
   .assign(date_diff=lambda x: x['level_1'].astype(int) - x['level_0'].dt.year)
   .pivot(index='level_0', columns='date_diff', values=0)
   .add_prefix('Y_')
)

Output: Output：

date_diff   Y_0  Y_1  Y_2
level_0                  
2016-01-01  0.0  1.0  NaN
2016-07-01  1.0  1.0  4.1
2017-01-01  5.0  3.0  NaN
2017-07-01  2.0  2.0  NaN

改造一个pandas dataframe：需要更高效的解决方案

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-01-14 15:41:44

解决方案2
0 2021-01-14 15:44:40

改造一个pandas dataframe：需要更高效的解决方案

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-01-14 15:41:44

解决方案2 0 2021-01-14 15:44:40

解决方案1
1 已采纳 2021-01-14 15:41:44

解决方案2
0 2021-01-14 15:44:40