如果在r中加快循环速度

Question

I have a dataframe called dataSessions, where I have 3 columns "Timestamp","CookieID","Name", with over 1,3 million rows. 我有一个名为dataSessions的数据框，其中有3列“时间戳记”，“ CookieID”，“名称”，其中有130万行。 It has been ordered according to CookieID and Timestamp. 已根据CookieID和时间戳进行订购。

I want to create a new column called "Sessions", which displays 1 or 0 according to some criteria. 我想创建一个称为“会话”的新列，根据某些条件显示1或0。

The criteria for 1 is: 1的标准是：

1) The previous cookie is not the same as the current
2) The time between the same cookieID is over 30 minutes

I have tried to do a code where a for if loop runs each row and checks if the CookieID has been there before. 我试图做一个代码，其中for for循环在每一行中运行，并检查CookieID是否在此之前。 But this procedure takes a loooong time. 但是此过程需要很长时间。 Is there a quicker and more efficient way to do this? 有更快，更有效的方法吗？

dataSessions$Test<-lag(dataSessions$CookieID, n = 1)

for (i in 1:length(dataSessions$CookieID)) {
  if(dataSessions$CookieID[i] %in% dataSessions$Test[i]) {
    dataSessions$New[i] <- 0
  } else {
    dataSessions$New[i] <- 1
  }
}

Here is an example of the data, and the SESSIONS column I want generated: 这是数据的示例，以及我要生成的SESSIONS列：

Timestamp              CookieID     Name     SESSIONS
2015-08-28 15:46:03    223284       A        1
2015-09-19 22:26:50    223223       A        1
2015-09-19 22:27:09    223223       A        0
2015-09-19 22:28:11    223223       A        0
2015-09-20 22:29:14    245458       B        1
2015-09-20 22:30:17    245458       B        0
2015-09-20 23:05:01    245458       B        1
2015-09-20 23:06:15    245458       B        0

As is shown, Sessions are only 1 when beginning a new CookieID, or when a CookieIDs last entry is more than 30 minutes old. 如图所示，当开始新的CookieID或CookieIDs的最后条目超过30分钟时，会话数仅为1。

Answer 1

There's probably a faster way to do this with data.table , but in the meantime: 使用data.table可能有一种更快的方法，但是与此同时：

dd <- read.csv(header=TRUE,
stringsAsFactors=FALSE,text="
Timestamp,CookieID,Name,SESSIONS
2015-08-28 15:46:03,223284,A,1
2015-09-19 22:26:50,223223,A,1
2015-09-19 22:27:09,223223,A,0
2015-09-19 22:28:11,223223,A,0
2015-09-20 22:29:14,245458,B,1
2015-09-20 22:30:17,245458,B,0
2015-09-20 23:05:01,245458,B,1
2015-09-2023:06:15,245458,B,0")

dd$Timestamp <- as.POSIXct(dd$Timestamp)

Find time diff (in seconds, converted to half-hours) - set time between first observation and "previous" to infinite: 查找时间差异（以秒为单位，转换为半小时）-将第一次观察到“上一个”之间的时间设置为无限：

dt <- c(Inf,diff(dd$Timestamp)/(60*30))

Find cookie diff: 查找Cookie差异：

dcookie <- c(NA,diff(dd$CookieID))

Check either case: 检查任何一种情况：

dd$SESSIONS <- as.numeric(dcookie!=0 | dt >1)

The logic here is that we are looking for cases where 这里的逻辑是我们正在寻找以下情况

dcookie!=0 : the difference between the previous and current (numeric) cookie values is not zero (ie, cookie has changed) dcookie!=0 ：以前的和当前的（数字）cookie值之差不为零（即cookie已更改）
dt>1 : the difference between the previous and current time stamp is > 1 half-hour dt>1 ：上一个时间戳与当前时间戳之差> 1个半小时

In a context where we could do efficient looping (almost any language but R, eg Python or using C++ code via Rcpp ) we would want to first check for equality of cookies (faster than subtraction), then if cookies were equal do the time difference calculation - that would shave off a bit of time. 在我们可以进行高效循环的环境中（几乎是R以外的任何语言，例如Python或通过Rcpp使用C ++代码），我们希望首先检查cookie的相等性（快于减法），然后如果 cookie相等，则进行时间差计算-会节省一些时间。

Answer 2

A data.table alternative to the answer of @BenBolker is: data.table的答案的data.table替代方法是：

library(data.table)
setDT(df)[, session := +(Timestamp - shift(Timestamp, 1L, "lag") > 1800 | 
                           CookieID != shift(CookieID, 1L, "lag"))
          ][1, session:=1]

this gives: 这给出了：

> df
             Timestamp CookieID Name session
1: 2015-08-28 15:46:03   223284    A       1
2: 2015-09-19 22:26:50   223223    A       1
3: 2015-09-19 22:27:09   223223    A       0
4: 2015-09-19 22:28:11   223223    A       0
5: 2015-09-20 22:29:14   245458    B       1
6: 2015-09-20 22:30:17   245458    B       0
7: 2015-09-20 23:05:01   245458    B       1
8: 2015-09-20 23:06:15   245458    B       0

Used data: 使用的数据：

df <- structure(list(Timestamp = structure(c(1440769563, 1442694410, 1442694429, 1442694491, 1442780954, 1442781017, 1442783101, 1442783175), class = c("POSIXct", "POSIXt"), tzone = ""), CookieID = c(223284L, 223223L, 223223L, 223223L, 245458L, 245458L, 245458L, 245458L), Name = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = ("Timestamp", "CookieID", "Name"), row.names = c(NA, -8L), class = "data.frame")

如果在r中加快循环速度

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-10-01 13:04:08

解决方案2
2 2015-10-01 13:28:09

如果在r中加快循环速度

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-10-01 13:04:08

解决方案2 2 2015-10-01 13:28:09

解决方案1
3 已采纳 2015-10-01 13:04:08

解决方案2
2 2015-10-01 13:28:09