[英]Can I apply a function that uses 'shift' on a grouped data frame, and return a simple data frame from pandas?
I hope the subject line is relatively clear. 我希望主题线比较清楚。 I'm using python/ pandas, and I'm working with daily pricing data on equities.
我正在使用python / pandas,并且正在处理股票的每日定价数据。 I have one large csv file with data on 4000+ symbols, with approximately 100 days' data.
我有一个大型的csv文件,其中包含4000多个符号的数据,大约有100天的数据。 So there are many repeated date and symbol values, but symbol/ date combinations are unique.
因此,有许多重复的日期和符号值,但是符号/日期组合是唯一的。 I'm trying to get percentage change on each ticker/ date combination, for multiple lag (shift) dates.
我正在尝试获取每个行情/日期组合的百分比变化,以获取多个延迟(轮班)日期。 On a dataset of one symbol, this would be as simple as
在一个符号的数据集上,这就像
(dataframe.Close - dataframe.Close.shift(1)) / dataframe.shift(1).
Here is a sample of the initial data: 以下是初始数据的示例:
TradeDate Symbol Close
1/1/2014 A 10.00
1/2/2014 A 11.00
1/3/2014 A 10.50
1/1/2014 B 2.00
1/2/2014 B 2.10
1/3/2014 B 2.05
The output I'm trying to get is: 我想要得到的输出是:
TradeDate Symbol Perf1 Perf2
1/1/2014 A NA NA
1/2/2014 A 0.10 NA
1/3/2014 A -0.045 0.05
1/1/2014 B NA NA
1/2/2014 B 0.05 NA
1/3/2014 B -0.024 0.025
I'm new to pandas, and I've been scouring the web for a similar example or more general treatment on applying vectorized functions on groups in pandas. 我是熊猫的新手,我一直在网上搜寻类似的例子或对将向量化函数应用于熊猫中的组的更一般的处理。 I'm not having much luck;
我运气不太好。 I experimented with more traditional methods, looping over a list of unique tickers, calculating the performance percentages individually, assembling them into a data frame, then appending that to a 'master' data frame.
我尝试了更传统的方法,遍历一系列唯一的行情自动收录器,分别计算性能百分比,将其组装到数据框中,然后将其附加到“主”数据框中。 It works, but takes 20+ minutes (and happens to be extremely clunky).
它可以工作,但是要花费20分钟以上的时间(而且碰巧很笨拙)。 I'm sure there's a better way.
我敢肯定有更好的方法。 But I don't yet know enough of how to ask for specific functionality details.
但是我对如何要求特定功能的细节还不了解。
Can anyone help? 有人可以帮忙吗? Thanks...
谢谢...
I think you can use groupby
and pct_change
(don't blame me for the name..). 我认为您可以使用
groupby
和pct_change
(不要怪我..)。
First, let's make sure everything's a real time and sort it: 首先,让我们确保所有内容都是实时的并对其进行排序:
>>> df["TradeDate"] = pd.to_datetime(df["TradeDate"])
>>> df = df.sort(["Symbol", "TradeDate"])
>>> df
TradeDate Symbol Close
0 2014-01-01 A 10.00
1 2014-01-02 A 11.00
2 2014-01-03 A 10.50
3 2014-01-01 B 2.00
4 2014-01-02 B 2.10
5 2014-01-03 B 2.05
And then do the work: 然后做工作:
>>> df.groupby("Symbol")["Close"].pct_change()
0 NaN
1 0.100000
2 -0.045455
3 NaN
4 0.050000
5 -0.023810
dtype: float64
>>> df["Perf1"] = df.groupby("Symbol")["Close"].pct_change()
>>> df["Perf2"] = df.groupby("Symbol")["Close"].pct_change(2)
>>> df
TradeDate Symbol Close Perf1 Perf2
0 2014-01-01 A 10.00 NaN NaN
1 2014-01-02 A 11.00 0.100000 NaN
2 2014-01-03 A 10.50 -0.045455 0.050
3 2014-01-01 B 2.00 NaN NaN
4 2014-01-02 B 2.10 0.050000 NaN
5 2014-01-03 B 2.05 -0.023810 0.025
It would probably be cleaner to do the grouping once, eg 进行一次分组可能会更干净一些,例如
grouped = df.groupby("Symbol")["Close"]
for i in range(1,5):
df["Perf{}".format(i)] = grouped.pct_change(i)
or something. 或者其他的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.