[英]Reset the index for a pandas DataFrame created from a groupby or pivot?
I have data that contains prices, volumes and other data about various financial securities. 我的数据包含有关各种金融证券的价格,数量和其他数据。 My input data looks like the following: 我的输入数据如下所示:
import numpy as np
import pandas
prices = np.random.rand(15) * 100
volumes = np.random.randint(15, size=15) * 10
idx = pandas.Series([2007, 2007, 2007, 2007, 2007, 2008,
2008, 2008, 2008, 2008, 2009, 2009,
2009, 2009, 2009], name='year')
df = pandas.DataFrame.from_items([('price', prices), ('volume', volumes)])
df.index = idx
# BELOW IS AN EXMPLE OF WHAT INPUT MIGHT LOOK LIKE
# IT WON'T BE EXACT BECAUSE OF THE USE OF RANDOM
# price volume
# year
# 2007 0.121002 30
# 2007 15.256424 70
# 2007 44.479590 50
# 2007 29.096013 0
# 2007 21.424690 0
# 2008 23.019548 40
# 2008 90.011295 0
# 2008 88.487664 30
# 2008 51.609119 70
# 2008 4.265726 80
# 2009 34.402065 140
# 2009 10.259064 100
# 2009 47.024574 110
# 2009 57.614977 140
# 2009 54.718016 50
I want to produce a data frame that looks like: 我想产生一个看起来像这样的数据框:
year 2007 2008 2009
0 0.121002 23.019548 34.402065
1 15.256424 90.011295 10.259064
2 44.479590 88.487664 47.024574
3 29.096013 51.609119 57.614977
4 21.424690 4.265726 54.718016
I know of one way to produce the output above using groupby: 我知道一种使用groupby产生上述输出的方法:
df = df.reset_index()
grouper = df.groupby('year')
df2 = None
for group, data in grouper:
series = data['price'].copy()
series.index = range(len(series))
series.name = group
df2 = pandas.DataFrame(series) if df2 is None else pandas.concat([df2, series], axis=1)
And I also know that you can do pivot to get a DataFrame that has NaNs for the missing indices on the pivot: 而且我也知道,您可以进行数据透视以获取一个数据框,该数据框具有数据透视缺失的索引所对应的NaN:
# df = df.reset_index()
df.pivot(columns='year', values='price')
# Output
# year 2007 2008 2009
# 0 0.121002 NaN NaN
# 1 15.256424 NaN NaN
# 2 44.479590 NaN NaN
# 3 29.096013 NaN NaN
# 4 21.424690 NaN NaN
# 5 NaN 23.019548 NaN
# 6 NaN 90.011295 NaN
# 7 NaN 88.487664 NaN
# 8 NaN 51.609119 NaN
# 9 NaN 4.265726 NaN
# 10 NaN NaN 34.402065
# 11 NaN NaN 10.259064
# 12 NaN NaN 47.024574
# 13 NaN NaN 57.614977
# 14 NaN NaN 54.718016
My question is the following: 我的问题如下:
Is there a way that I can create my output DataFrame in the groupby without creating the series, or is there a way I can re-index my input DataFrame so that I get the desired output using pivot? 有没有一种方法可以在不创建序列的情况下在groupby中创建我的输出DataFrame,或者有一种我可以重新索引我的输入DataFrame以便使用数据透视表获得所需输出的方法?
You need to label each year 0-4. 您需要每年标记0-4。 To do this, use the cumcount
after grouping. 为此,请在分组后使用cumcount
。 Then you can pivot correctly using that new column as the index. 然后,您可以使用该新列作为索引正确地进行数据透视。
df['year_count'] = df.groupby(level='year').cumcount()
df.reset_index().pivot(index='year_count', columns='year', values='price')
year 2007 2008 2009
year_count
0 61.682275 32.729113 54.859700
1 44.231296 4.453897 45.325802
2 65.850231 82.023960 28.325119
3 29.098607 86.046499 71.329594
4 67.864723 43.499762 19.255214
You can use groupby
with apply
new Series
created with numpy array
by values
and then reshape by unstack
: 您可以将groupby
与通过values
apply
numpy array
创建的新Series
结合使用,然后通过unstack
重塑unstack
:
print (df.groupby(level='year')['price'].apply(lambda x: pd.Series(x.values)).unstack(0))
year 2007 2008 2009
0 55.360804 68.671626 78.809139
1 50.246485 55.639250 84.483814
2 17.646684 14.386347 87.185550
3 54.824732 91.846018 60.793002
4 24.303751 50.908714 22.084445
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.