[英]R Programming: How to vectorize/speed up a for loop that need the previous value in each process
I'm doing a for loop to fill a vector. 我正在做一个for循环来填充向量。 The problem is that in each loop it needs the previous value to keep doing calcs. 问题在于,在每个循环中,它都需要先前的值来继续进行计算。
I'm using the package data.table, so the its a data table. 我正在使用包data.table,所以它是一个数据表。 R version 64 bits 3.2.3 R版本64位3.2.3
The table has the f Im doing a for loop, but it takes time to run I would like to know if there is a way to vectorize or make this process faaster. 该表具有f Im做的for循环,但是我想知道是否有一种方法可以向量化或使该过程更麻烦,因此需要花费一些时间来运行。 I will explain what Im trying to achieve. 我将解释我试图达到的目标。 First I have a table that as I have to use a loop for this part because I need the previous value so I cannot vectorize the operation. 首先,我有一个表,该表必须对此部分使用循环,因为我需要先前的值,因此无法对操作进行向量化。
The data table has the following structure: 数据表具有以下结构:
NUMDCRED FDES Distancia CURA NPV
0001 "2012-01-01" 11 0 1
0001 "2012-02-01" 12 0 2
0001 "2012-03-01" 13 1 2
0001 "2011-01-01" 14 1 3
0001 "2011-02-01" 15 1 3
0001 "2011-03-01" 16 1 2
0001 "2011-04-01" 10 0 5
0001 "2011-05-01" 11 0 4
0001 "2011-06-01" 12 0 6
0001 "2011-07-01" 13 1 3
0001 "2011-08-01" 14 1 2
0001 "2011-09-01" 15 1 2
0001 "2011-10-01" 16 1 1
0001 "2011-11-01" 17 1 3
0002 "2012-04-01" 11 0 6
0002 "2012-05-01" 12 0 5
0002 "2012-06-01" 13 1 4
0002 "2012-07-01" 14 1 3
0002 "2012-08-01" 15 1 3
0002 "2012-09-01" 16 1 3
0002 "2012-10-01" 10 0 3
0002 "2012-11-01" 11 0 4
0002 "2012-12-01" 12 0 4
0002 "2013-01-01" 13 1 2
0002 "2013-02-01" 14 1 2
0002 "2013-03-01" 15 1 3
0002 "2013-04-01" 16 1 3
The table is sorted (POBLACION_MOROSA6) by NUMDCRED and FDES (ascending order). 该表按NUMDCRED和FDES(升序)排序(POBLACION_MOROSA6)。 What I need to do is to create other variable called P.Moroso, which value is set to one when the first different NUMDCRED appears, inscrease to P.Moroso + 1 when the condition NPV < 4 and Distancia > 12 and Cura[i-1] != 1 is reached. 我需要做的是创建另一个名为P.Moroso的变量,当第一个不同的NUMDCRED出现时该值设置为1,当条件NPV <4且Distancia> 12且Cura [i- 1]!= 1。 The value of P.Moroso must be keep it in each record until it changes when the condition is reached, with this I mean when the first NUMDCRED appears the value of P.Moroso is going to be 1 and also for the next record until it change to P.Moroso + 1 (2) when the condition is met and then this value would be keep it each record and so on. P.Moroso的值必须保留在每个记录中,直到达到条件时它才更改,这意味着当第一个NUMDCRED出现时,P.Moroso的值将为1,对于下一个记录也将为1。满足条件时更改为P.Moroso + 1(2),然后将此值保留为每个记录,依此类推。
The output of the process would be the following: 该过程的输出如下:
NUMDCRED FDES Distancia CURA NPV P.Moroso
0001 "2012-01-01" 11 0 1 1
0001 "2012-02-01" 12 0 2 1
0001 "2012-03-01" 13 1 2 2
0001 "2011-01-01" 14 1 3 2
0001 "2011-02-01" 15 1 3 2
0001 "2011-03-01" 16 1 2 2
0001 "2011-04-01" 10 0 5 2
0001 "2011-05-01" 11 0 4 2
0001 "2011-06-01" 12 0 6 2
0001 "2011-07-01" 13 1 3 3
0001 "2011-08-01" 14 1 2 3
0001 "2011-09-01" 15 1 2 3
0001 "2011-10-01" 16 1 1 3
0001 "2011-11-01" 17 1 3 3
0002 "2012-04-01" 11 0 6 1
0002 "2012-05-01" 12 0 5 1
0002 "2012-06-01" 13 1 4 2
0002 "2012-07-01" 14 1 3 2
0002 "2012-08-01" 15 1 3 2
0002 "2012-09-01" 16 1 3 2
0002 "2012-10-01" 10 0 3 2
0002 "2012-11-01" 11 0 4 2
0002 "2012-12-01" 12 0 4 2
0002 "2013-01-01" 13 1 2 3
0002 "2013-02-01" 14 1 2 3
0002 "2013-03-01" 15 1 3 3
0002 "2013-04-01" 16 1 3 3
For the moment Im using the following simple foor loop to do this: 目前,Im使用以下简单的foor循环执行此操作:
PERIODO_MOROSO <- vector(mode = "numeric",length=N3)
isFirstNumdCred_Morosa6 <- (1:N3) %in% FIRST_NUMDCRED_INDEX_P.MOROSA6
for(i in 1:N3){
if(isFirstNumdCred_Morosa6[i]){
P.MOROSO <- 1
} else if(POBLACION_MOROSA6[i,NPV] < 4 & POBLACION_MOROSA6[i-1,CURA] ! =1 & POBLACION_MOROSA6[i,DISTANCIA_SALIDA] > 12){
P.MOROSO <- P.MOROSO + 1
}
PERIODO_MOROSO[i] <- P.MOROSO
}
POBLACION_MOROSA6$P.MOROSO <- PERIODO_MOROSO
The variable isFirstNumdCred_Morosa6 is a logical vector that indicates when the first different Numdcred Appears. 变量isFirstNumdCred_Morosa6是一个逻辑向量,指示何时出现第一个不同的Numdcred。 My problem with the foor loop is that it is slow when working with large data (my tables have rows between 900k and 2 million. I tried using something with 我的foor循环的问题是,在处理大数据时速度很慢(我的表的行在900k到200万之间。我尝试使用带有
ex[,date.seq.3:=ifelse( condition, shift(P.Moroso) +1 , P.Moroso)]
but it didn't work (first I assigned all the ones to the rows with the first different NUMDCRED) 但这没有用(首先我将所有的都分配给具有第一个不同NUMDCRED的行)
Also I tried using other methods that I other people told me in this question I posted before, but I couldn't do it. 我也尝试使用其他人在我之前发布的问题中告诉我的其他方法,但我做不到。 I put the link of the other question if anyone want to see the solution to a similar problema I had. 如果有人想看到我遇到的类似问题的解决方案,我会提出另一个问题的链接。
So in conclusion I would like to know if it is posible to vectorize/speed up this process. 因此,总而言之,我想知道是否可以矢量化/加速此过程。 R programming :How to speed up a loop that takes 2 hours and the reasons why it takes a lot R编程:如何加快耗时2小时的循环以及耗费大量时间的原因
You do not need loops 您不需要循环
ex <- read.table(header = TRUE, text = 'NUMDCRED FDES Distancia CURA NPV P.Moroso
0001 "2012-01-01" 11 0 1 1
0001 "2012-02-01" 12 0 2 1
0001 "2012-03-01" 13 1 2 2
0001 "2011-01-01" 14 1 3 2
0001 "2011-02-01" 15 1 3 2
0001 "2011-03-01" 16 1 2 2
0001 "2011-04-01" 10 0 5 2
0001 "2011-05-01" 11 0 4 2
0001 "2011-06-01" 12 0 6 2
0001 "2011-07-01" 13 1 3 3
0001 "2011-08-01" 14 1 2 3
0001 "2011-09-01" 15 1 2 3
0001 "2011-10-01" 16 1 1 3
0001 "2011-11-01" 17 1 3 3
0002 "2012-04-01" 11 0 6 1
0002 "2012-05-01" 12 0 5 1
0002 "2012-06-01" 13 1 4 2
0002 "2012-07-01" 14 1 3 2
0002 "2012-08-01" 15 1 3 2
0002 "2012-09-01" 16 1 3 2
0002 "2012-10-01" 10 0 3 2
0002 "2012-11-01" 11 0 4 2
0002 "2012-12-01" 12 0 4 2
0002 "2013-01-01" 13 1 2 3
0002 "2013-02-01" 14 1 2 3
0002 "2013-03-01" 15 1 3 3
0002 "2013-04-01" 16 1 3 3 ')
In base, you can write your logic into a function 在基础中,您可以将逻辑写入函数
f <- function(data)
cumsum(with(data, Distancia > 12 & NPV <= 4 & c(0, CURA[-length(CURA)]) != 1)) + 1L
and apply it to subsets of the data 并将其应用于数据的子集
ex$P.Moroso2 <- unlist(by(ex, dd$NUMDCRED, f))
identical(ex$P.Moroso, ex$P.Moroso2)
# [1] TRUE
Translated to data.table, this would look like 转换为data.table,这看起来像
setDT(ex)[, P.Moroso3 :=
cumsum(Distancia > 12 & NPV <= 4 & shift(CURA, fill = 0) != 1) + 1L
, by = NUMDCRED]
# or Frank says this works, anyways
You mean something like this...? 你的意思是这样的...? (suppose your table name is "TABLA") (假设您的表名是“ TABLA”)
P.moroso = c(1)
NUMDCRED = TABLA$NUMDCRED
Cura = TABLA$Cura
NPV = TABLA$NPV
Distancia = TABLA$Distancia #right now, I just created vectors with the needed columns information
N = length (NUMEDRED)
contador = 1 #the counter set in 1
for (i in 2:N){
if (NUMDCRED[i-1] != NUMDCRED[i])
contador = 1 #sets contador in 1 again
else if ((NVP[i] <4) && (Distancia[i] > 12)&& (Cura[i-1] != 1))
contador = contador +1 #if the condition happens, increases contador in 1
P.moroso[i] = contador #append contador in P.moroso vector.
}
Now, you should have a P.moroso vector with the numbers you want. 现在,您应该有了一个带有所需数字的P.moroso向量。 Finally, you attach it to your table: 最后,将其附加到表上:
TABLA$P.moroso = P.moroso
I think I have a fast solution, but I haven't tested it, so I don't really know. 我认为我有一个快速的解决方案,但是我还没有测试过,所以我真的不知道。 Here is my thought process: 这是我的思考过程:
you can first split the data by the value of NUMDCRED, since the value of P.Moroso always starts at 1 each time that NUMDCRED changes. 您可以先将数据除以NUMDCRED的值,因为每次NUMDCRED更改时P.Moroso的值始终从1开始。 Put each subset of the data into a list. 将数据的每个子集放入列表中。
You can now apply a function using lapply to each dataset in the list. 您现在可以使用lapply将函数应用于列表中的每个数据集。 First, create a column that is TRUE if the condition that you specified is satisfied and FALSE if the condition is not specified. 首先,如果满足您指定的条件,则创建一个TRUE列,如果未指定条件,则创建一个FALSE列。 Then, you can take a cumulative sum of this column and store this as your P.Moroso column. 然后,您可以对该列进行累加和并将其存储为P.Moroso列。 I think that should be what you want. 我认为那应该是您想要的。
Merge all of the data sets back together. 合并所有数据集。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.