简体   繁体   English

R编程:如何向量化/加速每个过程中需要先前值的for循环

[英]R Programming: How to vectorize/speed up a for loop that need the previous value in each process

I'm doing a for loop to fill a vector. 我正在做一个for循环来填充向量。 The problem is that in each loop it needs the previous value to keep doing calcs. 问题在于,在每个循环中,它都需要先前的值来继续进行计算。

I'm using the package data.table, so the its a data table. 我正在使用包data.table,所以它是一个数据表。 R version 64 bits 3.2.3 R版本64位3.2.3

The table has the f Im doing a for loop, but it takes time to run I would like to know if there is a way to vectorize or make this process faaster. 该表具有f Im做的for循环,但是我想知道是否有一种方法可以向量化或使该过程更麻烦,因此需要花费一些时间来运行。 I will explain what Im trying to achieve. 我将解释我试图达到的目标。 First I have a table that as I have to use a loop for this part because I need the previous value so I cannot vectorize the operation. 首先,我有一个表,该表必须对此部分使用循环,因为我需要先前的值,因此无法对操作进行向量化。

The data table has the following structure: 数据表具有以下结构:

NUMDCRED         FDES         Distancia      CURA   NPV
 0001        "2012-01-01"        11            0     1
 0001        "2012-02-01"        12            0     2
 0001        "2012-03-01"        13            1     2
 0001        "2011-01-01"        14            1     3
 0001        "2011-02-01"        15            1     3
 0001        "2011-03-01"        16            1     2 
 0001        "2011-04-01"        10            0     5
 0001        "2011-05-01"        11            0     4
 0001        "2011-06-01"        12            0     6 
 0001        "2011-07-01"        13            1     3
 0001        "2011-08-01"        14            1     2
 0001        "2011-09-01"        15            1     2
 0001        "2011-10-01"        16            1     1
 0001        "2011-11-01"        17            1     3
 0002        "2012-04-01"        11            0     6
 0002        "2012-05-01"        12            0     5
 0002        "2012-06-01"        13            1     4
 0002        "2012-07-01"        14            1     3
 0002        "2012-08-01"        15            1     3
 0002        "2012-09-01"        16            1     3
 0002        "2012-10-01"        10            0     3
 0002        "2012-11-01"        11            0     4
 0002        "2012-12-01"        12            0     4
 0002        "2013-01-01"        13            1     2
 0002        "2013-02-01"        14            1     2
 0002        "2013-03-01"        15            1     3
 0002        "2013-04-01"        16            1     3

The table is sorted (POBLACION_MOROSA6) by NUMDCRED and FDES (ascending order). 该表按NUMDCRED和FDES(升序)排序(POBLACION_MOROSA6)。 What I need to do is to create other variable called P.Moroso, which value is set to one when the first different NUMDCRED appears, inscrease to P.Moroso + 1 when the condition NPV < 4 and Distancia > 12 and Cura[i-1] != 1 is reached. 我需要做的是创建另一个名为P.Moroso的变量,当第一个不同的NUMDCRED出现时该值设置为1,当条件NPV <4且Distancia> 12且Cura [i- 1]!= 1。 The value of P.Moroso must be keep it in each record until it changes when the condition is reached, with this I mean when the first NUMDCRED appears the value of P.Moroso is going to be 1 and also for the next record until it change to P.Moroso + 1 (2) when the condition is met and then this value would be keep it each record and so on. P.Moroso的值必须保留在每个记录中,直到达到条件时它才更改,这意味着当第一个NUMDCRED出现时,P.Moroso的值将为1,对于下一个记录也将为1。满足条件时更改为P.Moroso + 1(2),然后将此值保留为每个记录,依此类推。

The output of the process would be the following: 该过程的输出如下:

NUMDCRED         FDES         Distancia      CURA   NPV  P.Moroso
 0001        "2012-01-01"        11            0     1      1
 0001        "2012-02-01"        12            0     2      1
 0001        "2012-03-01"        13            1     2      2
 0001        "2011-01-01"        14            1     3      2
 0001        "2011-02-01"        15            1     3      2
 0001        "2011-03-01"        16            1     2      2
 0001        "2011-04-01"        10            0     5      2
 0001        "2011-05-01"        11            0     4      2
 0001        "2011-06-01"        12            0     6      2
 0001        "2011-07-01"        13            1     3      3
 0001        "2011-08-01"        14            1     2      3
 0001        "2011-09-01"        15            1     2      3
 0001        "2011-10-01"        16            1     1      3
 0001        "2011-11-01"        17            1     3      3
 0002        "2012-04-01"        11            0     6      1
 0002        "2012-05-01"        12            0     5      1
 0002        "2012-06-01"        13            1     4      2
 0002        "2012-07-01"        14            1     3      2
 0002        "2012-08-01"        15            1     3      2
 0002        "2012-09-01"        16            1     3      2
 0002        "2012-10-01"        10            0     3      2
 0002        "2012-11-01"        11            0     4      2
 0002        "2012-12-01"        12            0     4      2
 0002        "2013-01-01"        13            1     2      3
 0002        "2013-02-01"        14            1     2      3
 0002        "2013-03-01"        15            1     3      3
 0002        "2013-04-01"        16            1     3      3  

For the moment Im using the following simple foor loop to do this: 目前,Im使用以下简单的foor循环执行此操作:

PERIODO_MOROSO <- vector(mode = "numeric",length=N3)
isFirstNumdCred_Morosa6 <- (1:N3) %in% FIRST_NUMDCRED_INDEX_P.MOROSA6

for(i in 1:N3){ 

   if(isFirstNumdCred_Morosa6[i]){

      P.MOROSO <- 1
   } else if(POBLACION_MOROSA6[i,NPV] < 4 & POBLACION_MOROSA6[i-1,CURA] ! =1   & POBLACION_MOROSA6[i,DISTANCIA_SALIDA] > 12){

     P.MOROSO <- P.MOROSO + 1
   }

   PERIODO_MOROSO[i] <- P.MOROSO
}

POBLACION_MOROSA6$P.MOROSO <- PERIODO_MOROSO 

The variable isFirstNumdCred_Morosa6 is a logical vector that indicates when the first different Numdcred Appears. 变量isFirstNumdCred_Morosa6是一个逻辑向量,指示何时出现第一个不同的Numdcred。 My problem with the foor loop is that it is slow when working with large data (my tables have rows between 900k and 2 million. I tried using something with 我的foor循环的问题是,在处理大数据时速度很慢(我的表的行在900k到200万之间。我尝试使用带有

ex[,date.seq.3:=ifelse( condition, shift(P.Moroso) +1 , P.Moroso)]

but it didn't work (first I assigned all the ones to the rows with the first different NUMDCRED) 但这没有用(首先我将所有的都分配给具有第一个不同NUMDCRED的行)

Also I tried using other methods that I other people told me in this question I posted before, but I couldn't do it. 我也尝试使用其他人在我之前发布的问题中告诉我的其他方法,但我做不到。 I put the link of the other question if anyone want to see the solution to a similar problema I had. 如果有人想看到我遇到的类似问题的解决方案,我会提出另一个问题的链接。

So in conclusion I would like to know if it is posible to vectorize/speed up this process. 因此,总而言之,我想知道是否可以矢量化/加速此过程。 R programming :How to speed up a loop that takes 2 hours and the reasons why it takes a lot R编程:如何加快耗时2小时的循环以及耗费大量时间的原因

You do not need loops 您不需要循环

ex <- read.table(header = TRUE, text = 'NUMDCRED         FDES         Distancia      CURA   NPV  P.Moroso
 0001        "2012-01-01"        11            0     1      1
                 0001        "2012-02-01"        12            0     2      1
                 0001        "2012-03-01"        13            1     2      2
                 0001        "2011-01-01"        14            1     3      2
                 0001        "2011-02-01"        15            1     3      2
                 0001        "2011-03-01"        16            1     2      2
                 0001        "2011-04-01"        10            0     5      2
                 0001        "2011-05-01"        11            0     4      2
                 0001        "2011-06-01"        12            0     6      2
                 0001        "2011-07-01"        13            1     3      3
                 0001        "2011-08-01"        14            1     2      3
                 0001        "2011-09-01"        15            1     2      3
                 0001        "2011-10-01"        16            1     1      3
                 0001        "2011-11-01"        17            1     3      3
                 0002        "2012-04-01"        11            0     6      1
                 0002        "2012-05-01"        12            0     5      1
                 0002        "2012-06-01"        13            1     4      2
                 0002        "2012-07-01"        14            1     3      2
                 0002        "2012-08-01"        15            1     3      2
                 0002        "2012-09-01"        16            1     3      2
                 0002        "2012-10-01"        10            0     3      2
                 0002        "2012-11-01"        11            0     4      2
                 0002        "2012-12-01"        12            0     4      2
                 0002        "2013-01-01"        13            1     2      3
                 0002        "2013-02-01"        14            1     2      3
                 0002        "2013-03-01"        15            1     3      3
                 0002        "2013-04-01"        16            1     3      3  ')

In base, you can write your logic into a function 在基础中,您可以将逻辑写入函数

f <- function(data)
  cumsum(with(data, Distancia > 12 & NPV <= 4 & c(0, CURA[-length(CURA)]) != 1)) + 1L

and apply it to subsets of the data 并将其应用于数据的子集

ex$P.Moroso2 <- unlist(by(ex, dd$NUMDCRED, f))

identical(ex$P.Moroso, ex$P.Moroso2)
# [1] TRUE

Translated to data.table, this would look like 转换为data.table,这看起来像

setDT(ex)[, P.Moroso3 := 
  cumsum(Distancia > 12 & NPV <= 4 & shift(CURA, fill = 0) != 1) + 1L
, by = NUMDCRED]
# or Frank says this works, anyways

You mean something like this...? 你的意思是这样的...? (suppose your table name is "TABLA") (假设您的表名是“ TABLA”)

P.moroso = c(1)
NUMDCRED = TABLA$NUMDCRED
Cura = TABLA$Cura
NPV = TABLA$NPV
Distancia = TABLA$Distancia   #right now, I just created vectors with the needed columns information

N = length (NUMEDRED)
contador = 1 #the counter set in 1
for (i in 2:N){
    if (NUMDCRED[i-1] != NUMDCRED[i])
       contador = 1  #sets contador in 1 again
    else if ((NVP[i] <4) && (Distancia[i] > 12)&& (Cura[i-1] != 1))
       contador = contador +1  #if the condition happens, increases contador in 1
    P.moroso[i] = contador #append contador in P.moroso vector.
}

Now, you should have a P.moroso vector with the numbers you want. 现在,您应该有了一个带有所需数字的P.moroso向量。 Finally, you attach it to your table: 最后,将其附加到表上:

TABLA$P.moroso = P.moroso

I think I have a fast solution, but I haven't tested it, so I don't really know. 我认为我有一个快速的解决方案,但是我还没有测试过,所以我真的不知道。 Here is my thought process: 这是我的思考过程:

  1. you can first split the data by the value of NUMDCRED, since the value of P.Moroso always starts at 1 each time that NUMDCRED changes. 您可以先将数据除以NUMDCRED的值,因为每次NUMDCRED更改时P.Moroso的值始终从1开始。 Put each subset of the data into a list. 将数据的每个子集放入列表中。

  2. You can now apply a function using lapply to each dataset in the list. 您现在可以使用lapply将函数应用于列表中的每个数据集。 First, create a column that is TRUE if the condition that you specified is satisfied and FALSE if the condition is not specified. 首先,如果满足您指定的条件,则创建一个TRUE列,如果未指定条件,则创建一个FALSE列。 Then, you can take a cumulative sum of this column and store this as your P.Moroso column. 然后,您可以对该列进行累加和并将其存储为P.Moroso列。 I think that should be what you want. 我认为那应该是您想要的。

  3. Merge all of the data sets back together. 合并所有数据集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM