I have a dataframe with following layout:
id |diff
----
1 | 0
1 | 3
1 | 45
1 | 9
1 | 40
1 | 34
1 | 43
1 | 7
2 | 0
2 | 5
3 | 0
3 | 45
3 | 40
I need to add a counter in a such a way that :
The output I am looking for is :
id |diff | counter
-------------
1 | 0 | 1
1 | 3 | 1
1 | 45 | 2
1 | 9 | 2
1 | 40 | 3
1 | 34 | 4
1 | 43 | 5
1 | 7 | 5
2 | 0 | 1
2 | 5 | 1
3 | 0 | 1
3 | 45 | 2
3 | 40 | 3
The for loop solution is :
for(i in 2:nrow(raw_data)){
raw_data$counter[i]<- ifelse(raw_data$id[i]==raw_data$id[i-1],
ifelse(raw_data$diff> 10,raw_data$counter[i-1] +1,raw_data$counter[i-1])
,1)}
I am aware of the increase in time due to 'for' loop. Looking for a faster way.
Hers's how to do that with dplyr
:
df1 <- read.table(text="id diff
1 0
1 3
1 45
1 9
1 40
1 34
1 43
1 7
2 0
2 5
3 0
3 45
3 40",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1%>%
group_by(id)%>%
mutate(counter=cumsum(diff>10)+1)
id diff counter
<int> <int> <dbl>
1 1 0 1
2 1 3 1
3 1 45 2
4 1 9 2
5 1 40 3
6 1 34 4
7 1 43 5
8 1 7 5
9 2 0 1
10 2 5 1
11 3 0 1
12 3 45 2
13 3 40 3
As the OP is looking for a faster way , here is a benchmark comparison of P Lapointe's dplyr
solution and a data.table
version.
The data.table
version is a re-write of P Lapointe's approach in data.table
syntax:
library(data.table) # CRAN version 1.10.4 used
DT <- fread(
"id |diff
1 | 0
1 | 3
1 | 45
1 | 9
1 | 40
1 | 34
1 | 43
1 | 7
2 | 0
2 | 5
3 | 0
3 | 45
3 | 40"
, sep = "|")
DT[, counter := cumsum(diff > 10L) + 1L, id]
DT
# id diff counter
# 1: 1 0 1
# 2: 1 3 1
# 3: 1 45 2
# 4: 1 9 2
# 5: 1 40 3
# 6: 1 34 4
# 7: 1 43 5
# 8: 1 7 5
# 9: 2 0 1
#10: 2 5 1
#11: 3 0 1
#12: 3 45 2
#13: 3 40 3
For benchmarking, a larger data set of 130'000 rows is created:
# copy original data set 10000 times
DTlarge <- rbindlist(lapply(seq_len(10000L), function(x) DT))
# make id column unique again
DTlarge[, id := rleid(id)]
dim(DTlarge)
#[1] 130000 2
Timing is done by the mircobenchmark
package:
df1 <- as.data.frame(DTlarge)
dt1 <- copy(DTlarge)
library(dplyr)
microbenchmark::microbenchmark(
dplyr = {
df1%>%
group_by(id)%>%
mutate(counter=cumsum(diff>10)+1)
},
dt = {
dt1[, counter := cumsum(diff > 10L) + 1L, id]
},
times = 10L
)
The results show that the data.table
version is about 20 times faster for this problem size:
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 500.51729 505.50173 512.25642 509.64096 517.31095 535.2736 10
dt 23.06037 23.99073 25.30913 24.71059 25.98322 30.7868 10
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.