I have a collection of user's IP addresses and the associated time they hit a particular website. I am trying to get the change in time between each IP address change. To make this easier, I've assigned each row a label as to whether or not it represents a change from the previous row, and I've done this on a per-user basis.
sample data:
user.nm ip.addr.txt login.sessn.ts change.label
b c 2/18/2013 16:08 FALSE
b c 2/18/2013 16:08 FALSE
b c 2/28/2013 13:37 FALSE
b c 2/28/2013 16:10 FALSE
b c 2/28/2013 16:20 FALSE
b c 3/5/2013 9:29 FALSE
b c 3/6/2013 11:42 FALSE
b c 3/11/2013 13:55 FALSE <-
b b 6/25/2013 13:22 TRUE <-
b d 6/25/2013 13:22 FALSE <-
b b 8/12/2013 13:18 TRUE <-
b c 8/12/2013 13:18 FALSE
b c 8/20/2013 15:13 FALSE
b c 8/20/2013 15:13 FALSE
b c 9/23/2013 14:08 FALSE
b c 9/23/2013 14:09 FALSE
b c 9/25/2013 11:00 FALSE
b c 10/18/2013 16:54 FALSE
b c 10/18/2013 16:54 FALSE
b c 10/30/2013 14:33 FALSE
b c 11/8/2013 15:03 FALSE
b c 11/18/2013 11:30 FALSE
b c 11/18/2013 11:33 FALSE
b c 11/20/2013 16:08 FALSE
b c 11/21/2013 11:51 FALSE
b c 11/21/2013 11:52 FALSE
b c 11/21/2013 15:18 FALSE
b c 11/21/2013 16:40 FALSE
b c 11/21/2013 16:44 FALSE
b c 11/21/2013 16:45 FALSE
b c 11/21/2013 16:45 FALSE
b c 11/29/2013 15:41 FALSE
b c 11/29/2013 15:41 FALSE
a a 1/9/2013 15:32 FALSE
a a 1/9/2013 15:32 FALSE
a a 1/9/2013 15:32 FALSE
a a 1/9/2013 15:32 FALSE
a a 1/10/2013 10:39 FALSE
a a 1/10/2013 10:39 FALSE
a a 1/10/2013 10:39 FALSE
a a 1/11/2013 10:31 FALSE
a a 1/11/2013 10:31 FALSE
a a 1/18/2013 12:30 FALSE
a a 2/22/2013 10:54 FALSE <-
a b 3/6/2013 12:27 TRUE <-
dput:
sample.data=structure(list(user.nm = c("b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b",
"a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a"),
ip.addr.txt = c("c", "c", "c", "c", "c", "c", "c", "c", "b",
"c", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "b"
), login.sessn.ts = structure(c(1361221680, 1361221680, 1362076620,
1362085800, 1362086400, 1362493740, 1362588120, 1363024500,
1372180920, 1372180920, 1376327880, 1376327880, 1377025980,
1377025980, 1379959680, 1379959740, 1380121200, 1382129640,
1382129640, 1383157980, 1383940980, 1384792200, 1384792380,
1384981680, 1385052660, 1385052720, 1385065080, 1385070000,
1385070240, 1385070300, 1385070300, 1385757660, 1385757660,
1357763520, 1357763520, 1357763520, 1357763520, 1357832340,
1357832340, 1357832340, 1357918260, 1357918260, 1358530200,
1361548440, 1362590820), class = c("POSIXct", "POSIXt"), tzone = ""),
change.label = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE)), .Names = c("user.nm", "ip.addr.txt",
"login.sessn.ts", "change.label"), row.names = c(NA, -45L), class = "data.frame")
I am attempting to write a ddply summarize statement to give me the difference in time between each IP change per user (among other things). Normally, I would just subset the DF by all observations that have the label TRUE and use this as my ddply dataframe. however, I need the difference between the pairs of rows where FALSE is immediately followed by a TRUE.
ideally, the output dataframe would look like this:
user.nm change count min.change.tme max.change.time
a 2 10 sec 4 hours
b 1 1 hour 1 hour
I was hoping to use some kind of index lookup function like match,
but I'm not sure how to translate this into a function.
is there some kind of "look-behind" function in R that could help with this?
my code so for getting the number of IP changes works well so far, and is below:
did.change<-function(vec){
#consumes vector
#returns a p-1 boolean vector of instances where element is not directly repeated (duplicated)
b.vec=head(vec, -1)==tail(vec, -1)
return(!b.vec)
}
###this function works on the ENTIRE list of entries per user, which is to broad
time.changes<-function(vec){
a=head(vec-1)-tail(vec,-1)
return(abs(a))
}
user.changes=ddply(sample.data, c("user.nm"), summarize,
change.count=sum(did.change(ip.addr.txt)))
#max.change.time=max(time.changes(login.sessn.ts)),
#min.change.time=min(time.changes(login.sessn.ts)))
Short answer: yes, and it is called diff
!
long answer:
is_diff <- which(diff(sample.data$change.label)==1)
ss <- do.call(c,lapply(is_diff,function(x) c(x,x+1)))
sample.data[ss,]
user.nm ip.addr.txt login.sessn.ts change.label
8 b c 2013-03-11 10:55:00 FALSE
9 b b 2013-06-25 10:22:00 TRUE
10 b c 2013-06-25 10:22:00 FALSE
11 b b 2013-08-12 10:18:00 TRUE
44 a a 2013-02-22 07:54:00 FALSE
45 a b 2013-03-06 09:27:00 TRUE
Here is one way to calculate the changes in login times:
ss_list <- lapply(is_diff,function(x) c(x,x+1))
logins <- lapply(ss_list,function(x) sample.data[x,"login.sessn.ts"])
library(lubridate)
lapply(logins,function(x) diff(ymd_hms(x)))
If you want to break that down my user.nm
, try using dplyr
:
library(dplyr)
sample.data %>%
mutate(rownum = 1:nrow(sample.data)) %>%
filter(rownum %in% ss) %>%
group_by(user.nm) %>%
mutate(change = login.sessn.ts - lag(login.sessn.ts))
user.nm ip.addr.txt login.sessn.ts change.label rownum change
1 b c 2013-03-11 10:55:00 FALSE 8 NA days
2 b b 2013-06-25 10:22:00 TRUE 9 9.156420e+06 days
3 b c 2013-06-25 10:22:00 FALSE 10 0.000000e+00 days
4 b b 2013-08-12 10:18:00 TRUE 11 4.146960e+06 days
5 a a 2013-02-22 07:54:00 FALSE 44 NA days
6 a b 2013-03-06 09:27:00 TRUE 45 1.206458e+01 days
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.