简体   繁体   English

在r中重新编码数据

[英]recoding data in r

I have got a huge 1000 x 100000 dataframe like following to recode to numberic values. 我有一个巨大的1000 x 100000数据框,像下面这样重新编码为数值。

myd <- data.frame (v1 = sample (c("AA", "AB", "BB", NA), 10, replace = T),
                   v2 = sample (c("CC", "CG", "GG", NA), 10, replace = T),
                   v3 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
                   v4 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
                   v5 = sample (c("CC", "CA", "AA", NA) , 10, replace = T)
                   )
myd
     v1   v2   v3   v4   v5
1    AB   CC <NA> <NA>   AA
2    AB   CG   TT   TT   AA
3    AA   GG   AT   AT   CA
4  <NA> <NA> <NA>   AT <NA>
5    AA <NA>   AA <NA>   CA
6    BB <NA>   TT   TT   CC
7    AA   GG   AA   AT   CA
8  <NA>   GG <NA>   AT   CA
9    AA <NA>   AT <NA>   CC
10   AA   GG   TT   AA   CC

Each variables have potentially four unique values. 每个变量都可能具有四个唯一值。

unique(myd$v1)

[1] AB   AA   <NA> BB  
Levels: AA AB BB

unique(myd$v2)

[1] CC   CG   GG   <NA>
  Levels: CC CG GG

Such unique values can be any combination however consists of two alphabets (-except NA). 这样的唯一值可以是任何组合,但是可以由两个字母组成(NA除外)。 For example "A", "B" in first case will make combination "AA", "AB", "BB". 例如,第一种情况下的“ A”,“ B”将组合为“ AA”,“ AB”,“ BB”。 The numberical code for these would be 1, 0, -1 respectively. 这些的数字代码分别为1、0,-1。 Similarly for second case alphabets "C", "G" makes "CC", "CG", "GG", thus the numberical codes would be 1, 0, -1 respectively. 同样,对于第二种情况,字母“ C”,“ G”表示“ CC”,“ CG”,“ GG”,因此数字代码分别为1、0,-1。 Thus the above myd need to be recoded to: 因此,上述myd需要重新编码为:

 myd
         v1   v2   v3    v4      v5
    1    0   1     <NA>  <NA>    1
    2    0   0     -1    -1      1
    3    1   -1     0    0       0
    4  <NA>  <NA>  <NA>   0     <NA>
    5    1  <NA>    1  < NA>      0
    6   -1  <NA>    -1    -1      -1
    7    1   -1    1      0        0
    8  <NA>   -1   <NA>   0        0
    9    1  <NA>    0    <NA>     -1
    10   1   -1    -1     1       -1

I will post a different solution -- (skip to data.table for the superfast approach!) 我将发布一个不同的解决方案-(跳至data.table以获取超快方法!)

If you want to recode AA, AB, BB , to 1,0,-1 etc you can use indexing (along with the factor to numeric solution). 如果要将AA, AB, BB编码为1,0,-1等,则可以使用索引(以及数字解的系数)。 This will let you have a different recoding if you wish! 如果需要,这将使您有不同的重新编码!

self made recode function 自制的重新编码功能

simple_recode <- function(.x, new_codes){
  new_codes[as.numeric(.x)]
 }

as.data.frame(lapply( myd, simple_recode, new_codes = 1:-1)) 

use factor 使用factor

You can simply relabel the letters by calling factor with the new levels as labels 您可以通过调用带有新级别的factor作为labels来简单地重新标记字母

as.data.frame(lapply(myd, factor, labels = 1:-1))

data.table for efficiency data.table效率

If your data is big, then I suggest a data.table approach which will be memory and time efficient. 如果您的数据很大,那么我建议您使用data.table方法,这将data.table内存并节省时间。

library(data.table)
DT <- as.data.table(myd)
as.data.table(DT[,lapply(.SD, simple_recode, new_codes = 1:-1))])

Or, more efficiently 或者,更有效

as.data.table(DT[, lapply(.SD, setattr, 'levels', 1:-1)])

Or, even more efficiently (modifying the levels in place, and avoiding the as.data.table call) 或者, 甚至更有效 (修改适当的级别,并避免使用as.data.table调用)

 for(name in names(DT)){
    setattr(DT[[name]],'levels',1:-1)
     }

setattr modifies by reference so no copying. setattr通过引用进行修改,因此不能复制。

Virtually Instantaneous approach using data.table and setattr 使用data.table和setattr的虚拟瞬时方法

As demonstrated on this big dataset 如这个数据集所示

# some big data (100 columns, 1e6 rows)
big  <- replicate(100, factor(sample(c('AA','AB','BB', NA), 1e6, T)), simplify = F)
bigDT <- as.data.table(big)

system.time({
  for(name in names(big)){
    setattr(big[[name]],'levels',1:-1)
     }
  }))

##  user  system elapsed 
##    0        0       0

You can take advantage of the fact that your data are factors, which have numeric indices underneath them. 您可以利用数据是因子的事实,这些因子在其下面具有数字索引。

For example: 例如:

> as.numeric(myd$v1)
 [1]  2  2  1 NA  1  3  1 NA  1  1

The numeric values correspond to the levels() of the factor: 数值对应于因子的levels()

> levels(myd$v1)
[1] "AA" "AB" "BB"

So 1 == AA , 2 == AB , 3 == BB ...and so on. 所以1 == AA ,2 == AB ,3 == BB ...

So you can simply convert your data to numeric, and apply the necessary maths to get your data scaled how you want it. 因此,您可以简单地将数据转换为数字,然后应用必要的数学运算来按需缩放数据。 So we can subtract by 2, and then multiply by -1 to get your results: 因此,我们可以减去2,然后乘以-1以得到结果:

(sapply(myd, as.numeric) - 2) * -1
#-----
      v1 v2 v3 v4 v5
 [1,]  0  1 NA NA  1
 [2,]  0  0 -1 -1  1
 [3,]  1 -1  0  0  0
 [4,] NA NA NA  0 NA
 [5,]  1 NA  1 NA  0
 [6,] -1 NA -1 -1 -1
 [7,]  1 -1  1  0  0
 [8,] NA -1 NA  0  0
 [9,]  1 NA  0 NA -1
[10,]  1 -1 -1  1 -1

If you set up an assignment so the LHS has the proper structure, you can use the implicitly coerced values of the factors as indices into the values you want: 如果设置分配以使LHS具有适当的结构,则可以使用因子的隐式强制值作为所需值的索引:

> myd[] <- c(-1,0,1)[data.matrix(myd)]
> myd
   v1 v2 v3 v4 v5
1  NA  0  0  0  1
2  -1  1  0  0 -1
3   0 NA  1  0  0
4  NA -1 -1  0 -1
5  -1  0  1 -1 NA
6   0 NA  0  1 NA
7  NA  0  1 NA -1
8   0  0  0 -1  1
9  -1 NA  1 -1 NA
10  0  1  1 NA NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM