简体   繁体   中英

Recode values within group

I have a data table with lots of individuals (id) that have been asked a question (class) n times. Sometimes their answer is 0 or 99 (which are non answer codes for "refused to answer" and "unknown", respectively), however when asked later they do answer the question.

How can I replace the 0 or 99 within an id?

dummy data:

library(data.table)
df <- data.table(
  id=rep(1:10,each=4), 
  class=c(1,1,1,1,1,1,1,99,0,0,0,1,0,2,2,2,99,99,99,
    1,3,3,3,0,2,2,0,99,99,99,99,99,1,1,1,1,0,0,0,0))

What I would like to get

res <- data.table(
  id=rep(1:10,each=4), 
  class=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,3,
    2,2,2,2,99,99,99,99,1,1,1,1,0,0,0,0))

To visualize the example...

> cbind(df, res = res[, !"id"])

    id class res.class
 1:  1     1         1
 2:  1     1         1
 3:  1     1         1
 4:  1     1         1
 5:  2     1         1
 6:  2     1         1
 7:  2     1         1
 8:  2    99         1
 9:  3     0         1
10:  3     0         1
11:  3     0         1
12:  3     1         1
13:  4     0         2
14:  4     2         2
15:  4     2         2
16:  4     2         2
17:  5    99         1
18:  5    99         1
19:  5    99         1
20:  5     1         1
21:  6     3         3
22:  6     3         3
23:  6     3         3
24:  6     0         3
25:  7     2         2
26:  7     2         2
27:  7     0         2
28:  7    99         2
29:  8    99        99
30:  8    99        99
31:  8    99        99
32:  8    99        99
33:  9     1         1
34:  9     1         1
35:  9     1         1
36:  9     1         1
37: 10     0         0
38: 10     0         0
39: 10     0         0
40: 10     0         0
    id class res.class

In practice I have ~100,000 individuals that's why I've tagged , though I am open to other (faster) suggestions.

With data.table , this can also be solved by updating while joining with a lookup table for each id which replaces all class values in df by the corresponding value of the lookup table.

The lookup table is created by

unique(df[!class %in% c(0,99)], by="id")
  id class 1: 1 1 2: 2 1 3: 3 1 4: 4 2 5: 5 1 6: 6 3 7: 7 2 8: 9 1 

The lookup table contains only entries for id s with at least one valid answer. In the subsequent update join the other id s without any valid answer at all are left untouched.

df[unique(df[!class %in% c(0,99)], by="id"), on = "id", class := i.class][]
  id class 1: 1 1 2: 1 1 3: 1 1 4: 1 1 5: 2 1 6: 2 1 7: 2 1 8: 2 1 9: 3 1 10: 3 1 11: 3 1 12: 3 1 13: 4 2 14: 4 2 15: 4 2 16: 4 2 17: 5 1 18: 5 1 19: 5 1 20: 5 1 21: 6 3 22: 6 3 23: 6 3 24: 6 3 25: 7 2 26: 7 2 27: 7 2 28: 7 2 29: 8 99 30: 8 99 31: 8 99 32: 8 99 33: 9 1 34: 9 1 35: 9 1 36: 9 1 37: 10 0 38: 10 0 39: 10 0 40: 10 0 id class 
# check result
all.equal(df$class, res$class)
 [1] TRUE 

Here is a simple two-step solution with data.table .

df[, class2 := min(class[class != 0 & class != 99]), by = id] # take the minimun value per group, excluding 0 and 99
df[, class_final := ifelse(is.infinite(class2), class, class2)] # take original value when is.infinite returns TRUE i.e. group with 0 or 99 only

all(df2$class == df$class_final) # check now 

Rcpp solution:

df <- data.table(id=rep(1:10,each=4), class=c(1,1,1,1,1,1,1,99,0,0,0,1,0,2,2,2,99,99,99,1,3,3,3,0,2,2,0,99,99,99,99,99,1,1,1,1,0,0,0,0))

cppFunction('std::vector<int> remap_class(std::vector<int> id, std::vector<int> df_class) {
  std::map<int, int> class_remap;
  for(int i=1; i<id.size(); i++) {
    if(df_class[i] != 0 & df_class[i] != 99) {
      class_remap[id[i]] = df_class[i];
    }
  }
  for(int i=1; i<id.size(); i++) {
    if(class_remap.count(id[i]) != 0) {
        df_class[i] = class_remap[id[i]];
      }
  }
  return(df_class);
}')

df$class <- remap_class(df$id, df$class)

Now check answer is the same.

The answer you posted:

df2 <- data.table(id=rep(1:10,each=4), class=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,3,2,2,2,2,99,99,99,99,1,1,1,1,0,0,0,0)) 

all(df2$class == df$class)
[1] TRUE

Here's a dplyr + tidyr solution :

library(dplyr) # for mutate, group_by and `%>%`
library(tidyr) # for fill
df1 %>%
  mutate(class2 = ifelse(class %in% c(0,99),NA,class)) %>% # we define new column with Nas to be able to use fill
  group_by(id) %>%
  fill(class2,.direction = "up")   %>% # we fill up and down
  fill(class2,.direction = "down") %>%
  mutate(class2 = ifelse(is.na(class2),class,class2)) # we replace remaining NAs by initial value

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM