简体   繁体   中英

Row wise operations in R

I have a dataframe with following layout:

id |diff  
----
1 | 0  
1 | 3  
1 | 45  
1 | 9  
1 | 40  
1 | 34  
1 | 43  
1 | 7  
2 | 0  
2 | 5  
3 | 0  
3 | 45  
3 | 40

I need to add a counter in a such a way that :

  1. when the id changes the counter should reset to 1
  2. when the id is same and the diff is less than 10 the counter shall give the preceding counter value.
  3. when the id is same and the diff is greater than 10 the counter shall be incremented by +1.

The output I am looking for is :

id |diff | counter
-------------
1 | 0 | 1  
1 | 3 | 1  
1 | 45 | 2  
1 | 9 |  2  
1 | 40 | 3  
1 | 34 | 4  
1 | 43 | 5  
1 | 7  | 5  
2 | 0  | 1  
2 | 5  | 1  
3 | 0  | 1  
3 | 45 | 2  
3 | 40 | 3

The for loop solution is :

for(i in 2:nrow(raw_data)){
  raw_data$counter[i]<- ifelse(raw_data$id[i]==raw_data$id[i-1],
        ifelse(raw_data$diff> 10,raw_data$counter[i-1] +1,raw_data$counter[i-1])
          ,1)}

I am aware of the increase in time due to 'for' loop. Looking for a faster way.

Hers's how to do that with dplyr :

df1 <- read.table(text="id diff  
                  1  0  
                  1  3  
                  1  45  
                  1  9  
                  1  40  
                  1  34  
                  1  43  
                  1  7  
                  2  0  
                  2  5  
                  3  0  
                  3  45  
                  3  40",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1%>%
  group_by(id)%>%
  mutate(counter=cumsum(diff>10)+1)

      id  diff counter
   <int> <int>   <dbl>
1      1     0       1
2      1     3       1
3      1    45       2
4      1     9       2
5      1    40       3
6      1    34       4
7      1    43       5
8      1     7       5
9      2     0       1
10     2     5       1
11     3     0       1
12     3    45       2
13     3    40       3

As the OP is looking for a faster way , here is a benchmark comparison of P Lapointe's dplyr solution and a data.table version.

The data.table version is a re-write of P Lapointe's approach in data.table syntax:

library(data.table)   # CRAN version 1.10.4 used
DT <- fread(
"id |diff  
1 | 0  
1 | 3  
1 | 45  
1 | 9  
1 | 40  
1 | 34  
1 | 43  
1 | 7  
2 | 0  
2 | 5  
3 | 0  
3 | 45  
3 | 40"
, sep = "|")

DT[, counter := cumsum(diff > 10L) + 1L, id]

DT
#    id diff counter
# 1:  1    0       1
# 2:  1    3       1
# 3:  1   45       2
# 4:  1    9       2
# 5:  1   40       3
# 6:  1   34       4
# 7:  1   43       5
# 8:  1    7       5
# 9:  2    0       1
#10:  2    5       1
#11:  3    0       1
#12:  3   45       2
#13:  3   40       3

Benchmark

For benchmarking, a larger data set of 130'000 rows is created:

# copy original data set 10000 times
DTlarge <- rbindlist(lapply(seq_len(10000L), function(x) DT))
# make id column unique again
DTlarge[, id := rleid(id)]
dim(DTlarge)
#[1] 130000      2

Timing is done by the mircobenchmark package:

df1 <- as.data.frame(DTlarge)
dt1 <- copy(DTlarge)
library(dplyr)
microbenchmark::microbenchmark(
  dplyr = {
    df1%>%
      group_by(id)%>%
      mutate(counter=cumsum(diff>10)+1)
  },
  dt = {
    dt1[, counter := cumsum(diff > 10L) + 1L, id]
  },
  times = 10L
)

The results show that the data.table version is about 20 times faster for this problem size:

Unit: milliseconds
  expr       min        lq      mean    median        uq      max neval
 dplyr 500.51729 505.50173 512.25642 509.64096 517.31095 535.2736    10
    dt  23.06037  23.99073  25.30913  24.71059  25.98322  30.7868    10

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM