简体   繁体   English

如何使用Rcpp避免在r中使用for循环

[英]how to use Rcpp to avoid using for loops in r

I have a xts format data (data) looks like this: 我有一个xts格式的数据(数据)看起来像这样:

                              A
2008-01-14 09:29:59           10 
2008-01-14 09:29:59           0.1
2008-01-14 09:30:00           0.9
2008-01-14 09:30:00           0.1
2008-01-14 09:30:00           0.2
2008-01-14 09:30:00           0.4
2008-01-14 09:30:00           0.6
2008-01-14 09:30:00           0.7
2008-01-14 09:30:02           1.5
2008-01-14 09:30:06           0.1
2008-01-14 09:30:06           0.1
2008-01-14 09:30:07           0.9
2008-01-14 09:30:07           0.2
2008-01-14 09:30:10           0.4
2008-01-14 09:30:10           0.3
2008-01-14 09:30:25           1.5 

There is no pattern in any column or row element. 在任何列或行元素中都没有模式。

The data is indexed by a POSIXct class object. 数据由POSIXct类对象索引。 I am creating new columns called '1second', '3second'. 我正在创建名为“ 1second”,“ 3second”的新列。 For column '1second', for each row, I want to find the next observation within the next 1 second according to their xts time object and record the 'A' value of the row. 对于列“ 1second”,对于每一行,我想根据其xts时间对象在接下来的1秒内找到下一个观察值,并记录该行的“ A”值。 If no observation within the next seconds, put NA in data$1second in that row. 如果在接下来的几秒钟内没有观察到,则在该行中将NA放入data $ 1second中。

Similarly, for column "3second", for each row, I want to find the leading observation within the next 3 second according to their xts time object. 类似地,对于列“ 3second”,对于每一行,我想根据其xts时间对象在接下来的3秒内找到领先的观察结果。 If there are multiple rows with the same time stamp within the next 3 seconds, then use the last observation only. 如果在接下来的3秒钟内有多行具有相同的时间戳记,则仅使用最后一个观察值。

If no observation within the next 3 seconds, put NA in data$3second in that row. 如果在接下来的3秒内没有观察到,请将NA放入data $ 3second的那一行。 For example, I expect the following results: 例如,我期望得到以下结果:

                              B    1second  3second
2008-01-14 09:29:59           10    0.7      1.5        
2008-01-14 09:29:59           0.1   0.7      1.5
2008-01-14 09:30:00           0.9   NA       1.5
2008-01-14 09:30:00           0.1   NA       1.5
2008-01-14 09:30:00           0.2   NA       1.5
2008-01-14 09:30:00           0.4   NA       1.5
2008-01-14 09:30:00           0.6   NA       1.5
2008-01-14 09:30:00           0.7   NA       1.5
2008-01-14 09:30:02           1.5   NA       NA
2008-01-14 09:30:06           0.1   0.2      0.2
2008-01-14 09:30:06           0.1   0.2      0.2
2008-01-14 09:30:07           0.9   NA       0.3
2008-01-14 09:30:07           0.2   NA       0.3
2008-01-14 09:30:10           0.4   NA       0.3
2008-01-14 09:30:10           0.3   NA       NA
2008-01-14 09:30:25           1.5   NA       NA

Here is my current code, it works, but very slow. 这是我当前的代码,它可以工作,但是非常慢。

TimeStmp is the POSIXct object.
      TimeHorizon<-c(1,3)
      for( j in 1:nrow(data)){
        a<-sapply(TimeHorizon,function(x) which(TimeStmp==TimeStmp[j] +x)) 
        for( k in 1:length(a)){
          if (length(a[[k]]>0)){
            data[j,k+1]<-(data$B)[last(a[[k]])]
          }
        }
      }

I am wondering if it is possible to use the Rcpp to avoid using the for loop. 我想知道是否有可能使用Rcpp来避免使用for循环。 Thank you so much for the help. 非常感谢你的帮助。

Not all too happy with the code, but it might be one approach: 并非所有人都对代码感到满意,但这可能是一种方法:

temp1 <- test[! duplicated(test$timestamp, fromLast = T), ]
for (i in c(0,rep(1,3))) {
  temp1$timestamp <- temp1$timestamp - i
  test <- merge(test, temp1, by = "timestamp", all.x = T)
}
colnames(test) <- c("timestamp", "B", "0second", "1second", "2second", "3second")
test$`3second` <- test[-1][cbind(1:nrow(test), max.col(!is.na(test[-1]), "last"))]
test$`3second`[shift(test$timestamp,1,type = "lead") - test$timestamp > 3 | is.na(shift(test$timestamp,1,type = "lead") - test$timestamp)] <- NA
test <- test[c("timestamp", "B", "1second", "3second")]
test
#              timestamp    B 1second 3second
# 1  2008-01-14 09:29:59  0.1     0.7     1.5
# 2  2008-01-14 09:29:59 10.0     0.7     1.5
# 3  2008-01-14 09:30:00  0.9      NA     1.5
# 4  2008-01-14 09:30:00  0.1      NA     1.5
# 5  2008-01-14 09:30:00  0.2      NA     1.5
# 6  2008-01-14 09:30:00  0.4      NA     1.5
# 7  2008-01-14 09:30:00  0.6      NA     1.5
# 8  2008-01-14 09:30:00  0.7      NA     1.5
# 9  2008-01-14 09:30:02  1.5      NA      NA
# 10 2008-01-14 09:30:06  0.1     0.2     0.2
# 11 2008-01-14 09:30:06  0.1     0.2     0.2
# 12 2008-01-14 09:30:07  0.9      NA     0.3
# 13 2008-01-14 09:30:07  0.2      NA     0.3
# 14 2008-01-14 09:30:10  0.3      NA     0.3
# 15 2008-01-14 09:30:10  0.4      NA      NA
# 16 2008-01-14 09:30:25  1.5      NA      NA

EDIT: Just saw that you want to use Rcpp. 编辑:刚看到您要使用Rcpp。 Well then just ignore this answer. 好吧,那就忽略这个答案。 :) :)

EDIT2: Explanation to my code. EDIT2:我的代码的解释。 Excuse me if the explanation is not the best: Instead of looping over the column, one first gets the last observation for each timestamp (line 1). 如果解释不是最好的话,请原谅:不是循环遍历该列,而是首先获得每个时间戳的最后观察值(第1行)。 Then one "left_joins" this onto the original dataframe. 然后一个“ left_joins”到原始数据帧。 Afterwards one subtracts one second from the timestamp and "left_joins" it onto the original dataframe again. 然后,从时间戳中减去一秒钟,然后再次“ left_joins”到原始数据帧。 This is repeated 3 times to account for 1 second, 2 second, and 3 second delays (lines 2-5). 重复执行3次以解决1秒,2秒和3秒的延迟(第2-5行)。 Now, it's a dataframe that contains the "correct" element in the same row; 现在,它是一个在同一行中包含“正确”元素的数据框; it's only a question of finding the correct column. 这只是找到正确列的问题。 The correct column is the largest one that does not have na for that row (line 7). 正确的列是该行没有na的最大列(第7行)。 We still need to set na the rows which don't have a follow-up observation in the next three seconds (line 8). 我们仍然需要在接下来的三秒钟内将没有后续观察的行设置为na (第8行)。 After doing that we can remove the unnecessary columns (line 9) and it's done. 这样做之后,我们可以删除不必要的列(第9行)。

If you want an Rcpp solution you can use 如果您需要Rcpp解决方案,则可以使用

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector name_me(List df, double nsec) {

  NumericVector TimeStmp = df["TimeStmp"];
  NumericVector B        = df["B"];
  int n = B.size();
  int i, j, k, ndup;
  double time;

  NumericVector res(n);

  for (i = 0; i < n; i++) {

    // get last for same second
    for (ndup = 0; (i+1) < n; i++, ndup++) {
      if (TimeStmp[i+1] != TimeStmp[i]) break;
    }

    // get last value within nsec
    time = TimeStmp[i] + nsec;
    for (j = i+1; j < n; j++) {
      if (TimeStmp[j] > time) break;
    }

    // fill all previous ones with same value
    res[i] = (j == (i+1)) ? NA_REAL : B[j-1];
    for (k = 1; k <= ndup; k++) res[i-k] = res[i];
  }

  return res;
}

Then, after sourcing this .cpp file, you just need to call 然后,在找到该.cpp文件之后,您只需调用

name_me(df, 1)
name_me(df, 3)

Note that there is an inconstitency in your (n-2)th row for 3 second. 请注意,在第(n-2)行中有3秒的不稳定。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM