[英]Pandas: How to groupby a dataframe and convert the rows to columns and consolidate the rows
Here's my data structure:这是我的数据结构:
date_time ticker stock_price type bid ask impVol symbol strike_price delta vega gamma theta rho diff
371 2021-02-19 14:28:45 AMZN 3328.23 put 44.5 46.85 NaN AMZN210226P03330000 3330.0 NaN NaN NaN NaN NaN 1.77
370 2021-02-19 14:28:45 AMZN 3328.23 call 43.5 45.80 NaN AMZN210226C03330000 3330.0 NaN NaN NaN NaN NaN 1.77
1066 2021-02-19 14:28:55 AMZN 3328.23 call 43.5 45.80 NaN AMZN210226C03330000 3330.0 NaN NaN NaN NaN NaN 1.77
1067 2021-02-19 14:28:55 AMZN 3328.23 put 44.5 46.85 NaN AMZN210226P03330000 3330.0 NaN NaN NaN NaN NaN 1.77
My goal is to group the date_time, then create a column for put's bid and ask and call's bid and ask.我的目标是对 date_time 进行分组,然后为看跌期权的出价和要价以及看涨的出价和要价创建一个列。
My expected output would be something like this:我预期的 output 会是这样的:
date_time ticker stock_price put_bid put_ask call_bid call_ask impVol symbol strike_price delta vega gamma theta rho diff
371 2021-02-19 14:28:45 AMZN 3328.23 44.5 46.85 43.5 45.80 NaN AMZN210226P03330000 3330.0 NaN NaN NaN NaN NaN 1.77
1066 2021-02-19 14:28:55 AMZN 3328.23 43.5 45.80 44.5 46.85 NaN AMZN210226C03330000 3330.0 NaN NaN NaN NaN NaN 1.77
I tried everything I can find for examples, including pivoting such as this:我尝试了所有我能找到的例子,包括这样的旋转:
df=pd.pivot_table(df,index=['date_time','type'],columns=df.groupby(['date_time','type']).cumcount().add(1),values=['market_price'],aggfunc='sum')
df.columns=df.columns.map('{0[0]}{0[1]}'.format)
I think I'm on the right path, but I just can't figure it out.我认为我走在正确的道路上,但我就是想不通。 Any help would be incredibly appreciated.任何帮助将不胜感激。
Why are you trying to use a groupby?为什么要尝试使用 groupby? pandas.pivot()
does the grouping for you. pandas.pivot()
为您进行分组。
You haven't provided a reproducible example (hint: please do next time) so I made up some random data to explain a possible solution.您没有提供可重现的示例(提示:请下次再做),所以我编造了一些随机数据来解释可能的解决方案。 Note this is not identical to what you need but it's a starting point:请注意,这与您需要的不同,但它是一个起点:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['period'] = np.repeat([1,2],2)
df['product'] = 'kiwi'
df['type'] = np.tile(['buy','sell'],2)
df['price'] = np.arange(1,5)
out = pd.pivot_table(df, index =['period','product'], columns = ['type'] , values ='price' )
You need to specify what you want on the left (index), what you want on the top (columns) and which values (values) you want to show for this combination.您需要在左侧(索引)指定您想要的内容,在顶部(列)想要的内容以及要为此组合显示哪些值(值)。
Also, are you sure the date time will be the same?另外,您确定日期时间会相同吗? What if in the first two rows it's even only one second off - is that possible?如果在前两行中它甚至只有一秒钟的时间 - 这可能吗? And what if the stock price is different between the first and the 2nd row of your table?如果表格的第一行和第二行的股票价格不同怎么办? I don't know your data so no idea if that is possible, but it's something to think about.我不知道你的数据,所以不知道这是否可能,但这是需要考虑的事情。
Also note that my example does not specify an aggregate function, so it defaults to the mean.另请注意,我的示例未指定聚合 function,因此默认为平均值。 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
To use a pivot table to reorient your data the way you're describing, you'll need to include all columns which vary with type, which in this case includes "symbol" (note the P vs. C in the code):要使用 pivot 表以您描述的方式重新定向您的数据,您需要包括所有随类型变化的列,在这种情况下包括“符号”(注意代码中的 P 与 C):
In [10]: pivoted = df.pivot(
...: index=['date_time', 'ticker', 'stock_price', 'impVol', 'strike_price','delta','vega', 'gamma','theta','rho','diff'],
...: columns=['type', 'symbol'],
...: values=['bid', 'ask'],
...: )
In [11]: pivoted
Out[11]:
bid ask
type put call put call
symbol AMZN210226P03330000 AMZN210226C03330000 AMZN210226P03330000 AMZN210226C03330000
date_time ticker stock_price impVol strike_price delta vega gamma theta rho diff
2021-02-19 14:28:45 AMZN 3328.23 NaN 3330.0 NaN NaN NaN NaN NaN 1.77 44.5 43.5 46.85 45.8
2021-02-19 14:28:55 AMZN 3328.23 NaN 3330.0 NaN NaN NaN NaN NaN 1.77 44.5 43.5 46.85 45.8
If you'd like, you could then relabel your columns:如果你愿意,你可以重新标记你的列:
In [12]: pivoted.columns = pd.Index([i[0] + '_' + i[1] for i in pivoted.columns.values])
In [13]: pivoted
Out[13]:
bid_put bid_call ask_put ask_call
date_time ticker stock_price impVol strike_price delta vega gamma theta rho diff
2021-02-19 14:28:45 AMZN 3328.23 NaN 3330.0 NaN NaN NaN NaN NaN 1.77 44.5 43.5 46.85 45.8
2021-02-19 14:28:55 AMZN 3328.23 NaN 3330.0 NaN NaN NaN NaN NaN 1.77 44.5 43.5 46.85 45.8
Alternatively, you could just exclude symbol from the index, but either way, you need to either stack symbol, drop it, or manually handle it some way because the data is not the same for each "type".或者,您可以只从索引中排除符号,但无论哪种方式,您都需要堆叠符号、删除它或以某种方式手动处理它,因为每种“类型”的数据都不相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.