[英]Transform a pandas dataframe: need for a more efficient solution
I have a dataframe indexed by dates from a certain period.我有一个 dataframe 由某个时期的日期索引。 My columns are predictions about the value of a variable by the end of a given year.
我的专栏是对给定年份年底变量值的预测。 My original dataframe looks something like this:
我原来的 dataframe 看起来像这样:
2016 2017 2018
2016-01-01 0.0 1 NaN
2016-07-01 1.0 1 4.1
2017-01-01 NaN 5 3.0
2017-07-01 NaN 2 2.0
where NaN means that the prediction does not exist for that given year.其中 NaN 表示该给定年份的预测不存在。
Since I am working with 20+ years and most predictions are for the next 2-3 years, my real dataframe has 20+ columns mostly containing NaN
values.由于我工作了 20 多年,而且大多数预测都是针对未来 2-3 年的,所以我的真实 dataframe 有 20 多列,其中大部分包含
NaN
值。 For instance, the column for the year 2005 has predictions made in 2003-2005, but in the range 2006-2020 it's all NaN
.例如,2005 年的列有 2003-2005 年的预测,但在 2006-2020 年的范围内都是
NaN
。
I would like to transform my dataframe to something like this:我想将我的 dataframe 转换成这样:
Y_0 Y_1 Y_2
2016-01-01 0 1 NaN
2016-07-01 1 1 4.1
2017-01,01 5 3 NaN
2017-07-01 2 2 NaN
where Y_j
represents the prediction made for the year = index.year + j
.其中
Y_j
表示对year = index.year + j
所做的预测。 This way, I would have a dataframe with only 4 columns (Y_0, Y_1, Y_2, Y_3).这样,我将得到一个只有 4 列(Y_0、Y_1、Y_2、Y_3)的 dataframe。
I actually achieved this, but in what I think it is a very inefficient way:我实际上实现了这一点,但我认为这是一种非常低效的方式:
for i in range(4):
df[f'Y_{i}'] = numpy.nan # create columns [Y_0, Y_1, Y_2, Y_3]
for index, row in df.iterrows(): # iterate through each row of df
for year in row.dropna().index: # iterate through each year where a prediction exists
year_diff = int(year) - index.year # get the difference between the years for which the prediction was made and when it was made (possible values: 0, 1, 2 or 3)
df.loc[index, f'Y_{year_diff}'] = df.loc[index, year] # set the values for the columns 'Y_0', 'Y_1', 'Y_2' and 'Y_3' cell by cell.
df = df.iloc[:, -4:] # delete all but the new columns
For a dataframe with only 1000 rows, this is taking almost 3 seconds to run.对于只有 1000 行的 dataframe,这将花费将近 3 秒的时间来运行。 Can anyone think of a better solution?
谁能想到更好的解决方案?
You could use melt
to convert it to the long format then pivot back based on the year differences.您可以使用
melt
将其转换为长格式,然后根据年份差异将其转换回 pivot。
Using your DataFrame as an example:以您的 DataFrame 为例:
df = pd.DataFrame({'date':[datetime.date(2016, 1, 1), datetime.date(2016, 7, 1),
datetime.date(2017, 1, 1), datetime.date(2017, 7, 1)],
2016:[0,1,np.nan,np.nan],
2017:[1,1,5,2],
2018:[np.nan, 4.1, 3, 2]})
df = df.melt(id_vars = 'date', value_vars = [2016, 2017, 2018], var_name='prediction_year', value_name='prediction')
Long format:长格式:
date prediction_year prediction
0 2016-01-01 2016 0.0
1 2016-07-01 2016 1.0
2 2017-01-01 2016 NaN
3 2017-07-01 2016 NaN
4 2016-01-01 2017 1.0
5 2016-07-01 2017 1.0
6 2017-01-01 2017 5.0
7 2017-07-01 2017 2.0
8 2016-01-01 2018 NaN
9 2016-07-01 2018 4.1
10 2017-01-01 2018 3.0
11 2017-07-01 2018 2.0
Convert back to the desired wide format:转换回所需的宽格式:
df['year'] = pd.to_datetime(df['date']).dt.year
df['dt'] = df['prediction_year'] - df['year']
df = df.pivot(index = 'date', columns='dt', values='prediction').dropna(axis = 1, how = 'all').add_prefix('Y_')
Y_0 Y_1 Y_2
date
2016-01-01 0.0 1.0 NaN
2016-07-01 1.0 1.0 4.1
2017-01-01 5.0 3.0 NaN
2017-07-01 2.0 2.0 NaN
Let's try stack
then calculate the year difference:让我们尝试
stack
然后计算年份差异:
# in index is not already datetime
df.index = pd.to_datetime(df.index)
df = (df.stack().reset_index()
.assign(date_diff=lambda x: x['level_1'].astype(int) - x['level_0'].dt.year)
.pivot(index='level_0', columns='date_diff', values=0)
.add_prefix('Y_')
)
Output: Output:
date_diff Y_0 Y_1 Y_2
level_0
2016-01-01 0.0 1.0 NaN
2016-07-01 1.0 1.0 4.1
2017-01-01 5.0 3.0 NaN
2017-07-01 2.0 2.0 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.