简体   繁体   English

计算r中数据帧中连续值的比率

[英]Calculating ratio of consecutive values in dataframe in r

I have a dataframe with 5 second intraday data of a stock. 我有一个数据,其中包含股票的5秒盘中数据。 The dataframe exists of a column for the date, one for the time and one for the price at that moment. 数据框存在一列日期,一列时间,一列价格。 I want to make a new column in which it calculates the ratio of two consecutive price values. 我想创建一个新列,在其中计算两个连续价格值的比率。 I tried it with a for loop, which works but is really slow. 我用for循环尝试过,它可以工作,但是速度很慢。

data["ratio"]<- 0
i<-2
for(i in 2:nrow(data))
{
  if(is.na(data$price[i])== TRUE){
    data$ratio[i] <- 0
  } else {
    data$ratio[i] <- ((data$price[i] / data$price[i-1]) - 1) 
  }
}

I was wondering if there is a faster option, since my dataset contains more than 500.000 rows. 我想知道是否有更快的选择,因为我的数据集包含超过500.000行。 I was already trying something with ddply: 我已经在尝试ddply了:

data["ratio"]<- 0
fun <- function(x){
  data$ratio <- ((data$price/lag(data$price, -1))-1)
}
ddply(data, .(data), fun)

and mutate: 并变异:

data<- mutate(data, (ratio =((price/lag(price))-1)))

but both don't work and I don't know how to solve it... Hopefully somebody can help me with this! 但两者都不起作用,我也不知道如何解决...希望有人可以帮助我!

You can use the lag function to shift the your data by one row and then take the ratio of the original data to the shifted data. 您可以使用lag功能将数据移位一行,然后获取原始数据与移位数据的比率。 This is vectorized, so you don't need a for loop, and it should be much faster. 这是矢量化的,因此不需要for循环,它应该快得多。 Also, the number of lag units in the lag function has to be positive, which may be causing an error when you run your code. 另外, lag函数中的滞后单位数必须为正,这可能会在您运行代码时引起错误。

# Create some fake data
set.seed(5)  # For reproducibility
dat = data.frame(x=rnorm(10))

dat$ratio = dat$x/lag(dat$x,1)

dat
             x       ratio
1  -0.84085548          NA
2   1.38435934 -1.64637013
3  -1.25549186 -0.90691183
4   0.07014277 -0.05586875
5   1.71144087 24.39939227
6  -0.60290798 -0.35228093
7  -0.47216639  0.78314834
8  -0.63537131  1.34565131
9  -0.28577363  0.44977422
10  0.13810822 -0.48327840

for loop in R can be extremely slow. R中的for循环可能非常慢。 Try to avoid it if you can. 如果可以的话,请尽量避免它。

datalen=length(data$price)

data$ratio[2:datalen]=data$price[1:datalen-1]/data$price[2:datalen]

You don't need to do the is.NA check, you will get NA in the result either the numerator or the denominator is NA. 您不需要进行is.NA检查,结果中将得到NA,无论分子还是分母都是NA。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM