简体   繁体   English

R中的For循环永远需要运行

[英]For loop in R takes forever to run

    people_id  activity_id success totl_act success_rate cum_success cum_act cum_success_rate success_rate_trend
       (fctr)       (fctr)   (int)    (int)        (dbl)       (int)   (int)            (dbl)              (dbl)
1     ppl_100 act2_1734928       0        1            0           0       1                0                 NA
2     ppl_100 act2_2434093       0        1            0           0       2                0                  0
3     ppl_100 act2_3404049       0        1            0           0       3                0                  0
4     ppl_100 act2_3651215       0        1            0           0       4                0                  0
5     ppl_100 act2_4109017       0        1            0           0       5                0                  0
6     ppl_100  act2_898576       0        1            0           0       6                0                  0
7  ppl_100002 act2_1233489       1        1            1           1       1                1                  1
8  ppl_100002 act2_1623405       1        1            1           2       2                1                  0
9  ppl_100003 act2_1111598       1        1            1           1       1                1                  0
10 ppl_100003 act2_1177453       1        1            1           2       2                1                  0

I've this sample data frame. 我有这个样本数据框。 I want to create a variable success_rate_trend using cum_success_rate variable. 我想创建一个变量success_rate_trend使用cum_success_rate变量。 The challenge is that I want it to compute for every activity_id except the first activity for every unique people_id ie I want to capture success trend for unique people_id . 面临的挑战是,我希望它为除每个不people_id的第一个activity_id之外的每个activity_id进行计算,即我想捕获唯一的people_id成功趋势。 I'm using the below code: 我正在使用以下代码:

success_rate_trend<-vector(mode="numeric", length=nrow(succ_rate_df)-1)
for(i in 2:nrow(succ_rate_df)){
     if(succ_rate_df[i,1]!=succ_rate_df[i-1,1]){
         success_rate_trend[i] = NA
       }
        else {
          success_rate_trend[i]<-succ_rate_df[i,8]-succ_rate_df[i-1,8]
    }}

It takes forever to run. 它需要永远运行。 I've close to million rows in succ_rate_df dataframe. 我在succ_rate_df数据succ_rate_df已接近百万行。 Can Anyone suggest how to simplify the code and reduce the run time. 任何人都可以建议如何简化代码并减少运行时间。

Use vectorization: 使用向量化:

success_rate_trend <- diff(succ_rate_df$cum_success_rate)
success_rate_trend[diff(as.integer(succ_rate_df$people_id)) != 0] <- NA_real_

Note: 注意:

  1. people_id is a factor variable (fctr) . people_id是一个因子变量(fctr) To use diff() we must use as.integer() or unclass() to remove the factor class. 要使用diff()我们必须使用as.integer()unclass()删除因子类。
  2. You are not having an ordinary data frame, but a tbl_df from dplyr . 您没有普通的数据帧,而是tbl_dfdplyr Matrix like indexing does not work. 类似于索引的矩阵不起作用。 Use succ_rate_df$people_id or succ_rate_df[["people_id"]] instead of succ_rate_df[, 1] . 使用succ_rate_df$people_idsucc_rate_df[["people_id"]]代替succ_rate_df[, 1]

You should be able to do this calculation using a vectorised approach. 您应该能够使用矢量化方法进行此计算。 This will be orders of magnitude quicker. 这将加快几个数量级。

n = nrow(succ_rate_df)
success_rate = succ_rate_df[2:n,1] == succ_rate_df[1:(n-1),1]
is_true = which(success_rate)

success_rate[is_true] = succ_rate_df[is_true+1,8]-succ_rate_df[is_true,8]
success_rate[!success_rate] = NA

The answer by Zheyuan Li is neater. 李哲远的答案比较整洁。

I'm going to offer an answer based on a dataframe version of this data. 我将基于此数据的数据框版本提供答案。 You SHOULD learn to post with the output of dput so that objects with special properties like the tibble you have printed above can be copied into other users consoles without loss of attributes. 您应该学习发布dput的输出,以便具有特殊属性的对象(如上面打印的dput可以复制到其他用户控制台中,而不会丢失属性。 I'm also going to name my dataframe dat . 我还将命名我的dataframe dat The ave function is appropriate for calculating numeric vectors when you want them to be the same length as an input vector but want those calculations restricted to grouping vector(s). 当您希望数字矢量与输入矢量具有相同的长度,但又希望这些计算仅限于分组矢量时, ave函数适用于计算数字矢量。 I only used one grouping factor, although you English language description of the problem suggested you wanted two. 我只使用了一个分组因子,尽管您对问题的英语描述表明您想要两个。 There are SO worked examples with two factors for grouping with ave . 有SO工作示例,其中有两个因素与ave分组。

 success_rate_trend <- with( dat, 
                    ave( cum_success_rate, people_id, FUN= function(x) c(NA, diff(x) ) ) )

 success_rate_trend
 [1] NA  0  0  0  0  0 NA  0 NA  0
 # not a very interesting result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM