Fastest way to map a new data frame column based on two other columns

Question

I have a data frame that looks something like this:

id|value
01| 100
01| 101
01| 300 #edited for case I originally left out
02| 300
03| 100
03| 101
04| 100

and I would like to add a new column that looks at both the id and the values assigned to each id.

For example: If an id has both a value 100 and 101 I will add it to category a. If an id has a value of 300 I will add it to category b. If an id has only one value (either 100 or 101, not both) assign it to category c.

result:

id|value|category
01| 100 |  a
01| 101 |  a
01| 300 |  b #edited for case I originally left out
02| 300 |  b
03| 100 |  a
03| 101 |  a
04| 100 |  c

I understand I can loop through it and assign the category, but my question is whether there is a faster vectorized way?

Answer 1

A couple of options with data.table

We could get the number of elements per 'id' that are '100', '101' and add them together. The output would be 0, 1, or 2 corresponding to none, single element, or both present. This can be converted to factor and change the labels so that 'a' would be '2', 'b' as '0' and 'c' as '1'.

library(data.table)
setDT(df2)[, indx:=sum(unique(value)==100)+sum(unique(value)==101), 
  id][, category:=factor(indx, levels=c(2,0,1), labels=letters[1:3]) ][,
   indx:=NULL][]
#    id value category
#1:  1   100        a
#2:  1   101        a
#3:  2   300        b
#4:  3   100        a
#5:  3   101        a
#6:  4   100        c

Or we could create a named vector ('v1') and use that as index to map the character elements ( toString(...) ) grouped by 'id'.

v1 <- c('100, 101' = 'a', '300'='b', '100'= 'c', '101'='c')
setDT(df2)[, category := v1[toString(sort(unique(value)))], by=id][]
#    id value category
#1:  1   100        a
#2:  1   101        a
#3:  2   300        b
#4:  3   100        a
#5:  3   101        a
#6:  4   100        c

Update

Based on the new dataset and the new condition, we can modify the first solution as

 setDT(df3)[, indx:= sum(unique(value)==100) + sum(unique(value)==101), id][, 
 category:= factor(indx, levels=c(2,0,1), labels=letters[1:3])][
 value==300, category:='b'][, indx:=NULL][]
 #    id value category
 #1:  1   100        a
 #2:  1   101        a
 #3:  1   300        b
 #4:  2   300        b
 #5:  3   100        a
 #6:  3   101        a
 #7:  4   100        c

Or using the second option

  v1 <- c('100, 101' = 'a', '100, 101, 300' = 'a', '300'='b',
            '100'= 'c', '101'='c')
  setDT(df3)[, category := v1[toString(sort(unique(value)))], 
                by=id][value==300, category := 'b'][]
  #   id value category
  #1:  1   100        a
  #2:  1   101        a
  #3:  1   300        b
  #4:  2   300        b
  #5:  3   100        a
  #6:  3   101        a
  #7:  4   100        c

data

df2 <- structure(list(id = c(1L, 1L, 2L, 3L, 3L, 4L), value = c(100L, 
101L, 300L, 100L, 101L, 100L)), .Names = c("id", "value"), 
row.names = c(NA, -6L), class = "data.frame")

df3 <- structure(list(id = c(1L, 1L, 1L, 2L, 3L, 3L, 4L), 
value = c(100L, 101L, 300L, 300L, 100L, 101L, 100L)),
.Names = c("id", "value"), class = "data.frame",
 row.names = c(NA, -7L))

Fastest way to map a new data frame column based on two other columns

Question

1 answers

solution1
2 ACCPTED 2015-04-01 17:55:51

Update

data

Fastest way to map a new data frame column based on two other columns

Question

1 answers

solution1 2 ACCPTED 2015-04-01 17:55:51

Update

data

solution1
2 ACCPTED 2015-04-01 17:55:51