recoding data in r

Question

I have got a huge 1000 x 100000 dataframe like following to recode to numberic values.

myd <- data.frame (v1 = sample (c("AA", "AB", "BB", NA), 10, replace = T),
                   v2 = sample (c("CC", "CG", "GG", NA), 10, replace = T),
                   v3 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
                   v4 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
                   v5 = sample (c("CC", "CA", "AA", NA) , 10, replace = T)
                   )
myd
     v1   v2   v3   v4   v5
1    AB   CC <NA> <NA>   AA
2    AB   CG   TT   TT   AA
3    AA   GG   AT   AT   CA
4  <NA> <NA> <NA>   AT <NA>
5    AA <NA>   AA <NA>   CA
6    BB <NA>   TT   TT   CC
7    AA   GG   AA   AT   CA
8  <NA>   GG <NA>   AT   CA
9    AA <NA>   AT <NA>   CC
10   AA   GG   TT   AA   CC

Each variables have potentially four unique values.

unique(myd$v1)

[1] AB   AA   <NA> BB  
Levels: AA AB BB

unique(myd$v2)

[1] CC   CG   GG   <NA>
  Levels: CC CG GG

Such unique values can be any combination however consists of two alphabets (-except NA). For example "A", "B" in first case will make combination "AA", "AB", "BB". The numberical code for these would be 1, 0, -1 respectively. Similarly for second case alphabets "C", "G" makes "CC", "CG", "GG", thus the numberical codes would be 1, 0, -1 respectively. Thus the above myd need to be recoded to:

 myd
         v1   v2   v3    v4      v5
    1    0   1     <NA>  <NA>    1
    2    0   0     -1    -1      1
    3    1   -1     0    0       0
    4  <NA>  <NA>  <NA>   0     <NA>
    5    1  <NA>    1  < NA>      0
    6   -1  <NA>    -1    -1      -1
    7    1   -1    1      0        0
    8  <NA>   -1   <NA>   0        0
    9    1  <NA>    0    <NA>     -1
    10   1   -1    -1     1       -1

Answer 1

I will post a different solution -- (skip to data.table for the superfast approach!)

If you want to recode AA, AB, BB , to 1,0,-1 etc you can use indexing (along with the factor to numeric solution). This will let you have a different recoding if you wish!

self made recode function

simple_recode <- function(.x, new_codes){
  new_codes[as.numeric(.x)]
 }

as.data.frame(lapply( myd, simple_recode, new_codes = 1:-1))

use `factor`

You can simply relabel the letters by calling factor with the new levels as labels

as.data.frame(lapply(myd, factor, labels = 1:-1))

`data.table` for efficiency

If your data is big, then I suggest a data.table approach which will be memory and time efficient.

library(data.table)
DT <- as.data.table(myd)
as.data.table(DT[,lapply(.SD, simple_recode, new_codes = 1:-1))])

Or, more efficiently

as.data.table(DT[, lapply(.SD, setattr, 'levels', 1:-1)])

Or, even more efficiently (modifying the levels in place, and avoiding the as.data.table call)

 for(name in names(DT)){
    setattr(DT[[name]],'levels',1:-1)
     }

setattr modifies by reference so no copying.

Virtually Instantaneous approach using data.table and setattr

As demonstrated on this big dataset

# some big data (100 columns, 1e6 rows)
big  <- replicate(100, factor(sample(c('AA','AB','BB', NA), 1e6, T)), simplify = F)
bigDT <- as.data.table(big)

system.time({
  for(name in names(big)){
    setattr(big[[name]],'levels',1:-1)
     }
  }))

##  user  system elapsed 
##    0        0       0

Answer 2

You can take advantage of the fact that your data are factors, which have numeric indices underneath them.

For example:

> as.numeric(myd$v1)
 [1]  2  2  1 NA  1  3  1 NA  1  1

The numeric values correspond to the levels() of the factor:

> levels(myd$v1)
[1] "AA" "AB" "BB"

So 1 == AA , 2 == AB , 3 == BB ...and so on.

So you can simply convert your data to numeric, and apply the necessary maths to get your data scaled how you want it. So we can subtract by 2, and then multiply by -1 to get your results:

(sapply(myd, as.numeric) - 2) * -1
#-----
      v1 v2 v3 v4 v5
 [1,]  0  1 NA NA  1
 [2,]  0  0 -1 -1  1
 [3,]  1 -1  0  0  0
 [4,] NA NA NA  0 NA
 [5,]  1 NA  1 NA  0
 [6,] -1 NA -1 -1 -1
 [7,]  1 -1  1  0  0
 [8,] NA -1 NA  0  0
 [9,]  1 NA  0 NA -1
[10,]  1 -1 -1  1 -1

Answer 3

If you set up an assignment so the LHS has the proper structure, you can use the implicitly coerced values of the factors as indices into the values you want:

> myd[] <- c(-1,0,1)[data.matrix(myd)]
> myd
   v1 v2 v3 v4 v5
1  NA  0  0  0  1
2  -1  1  0  0 -1
3   0 NA  1  0  0
4  NA -1 -1  0 -1
5  -1  0  1 -1 NA
6   0 NA  0  1 NA
7  NA  0  1 NA -1
8   0  0  0 -1  1
9  -1 NA  1 -1 NA
10  0  1  1 NA NA

recoding data in r

Question

3 answers

solution1
8 2012-09-17 23:53:59

self made recode function

use `factor`

`data.table` for efficiency

Virtually Instantaneous approach using data.table and setattr

solution2
7 ACCPTED 2012-09-17 16:06:01

solution3
4 2012-09-18 00:21:33

recoding data in r

Question

3 answers

solution1 8 2012-09-17 23:53:59

self made recode function

use factor

data.table for efficiency

Virtually Instantaneous approach using data.table and setattr

solution2 7 ACCPTED 2012-09-17 16:06:01

solution3 4 2012-09-18 00:21:33

solution1
8 2012-09-17 23:53:59

use `factor`

`data.table` for efficiency

solution2
7 ACCPTED 2012-09-17 16:06:01

solution3
4 2012-09-18 00:21:33