[英]recoding data in r
I have got a huge 1000 x 100000 dataframe like following to recode to numberic values. 我有一个巨大的1000 x 100000数据框,像下面这样重新编码为数值。
myd <- data.frame (v1 = sample (c("AA", "AB", "BB", NA), 10, replace = T),
v2 = sample (c("CC", "CG", "GG", NA), 10, replace = T),
v3 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
v4 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
v5 = sample (c("CC", "CA", "AA", NA) , 10, replace = T)
)
myd
v1 v2 v3 v4 v5
1 AB CC <NA> <NA> AA
2 AB CG TT TT AA
3 AA GG AT AT CA
4 <NA> <NA> <NA> AT <NA>
5 AA <NA> AA <NA> CA
6 BB <NA> TT TT CC
7 AA GG AA AT CA
8 <NA> GG <NA> AT CA
9 AA <NA> AT <NA> CC
10 AA GG TT AA CC
Each variables have potentially four unique values. 每个变量都可能具有四个唯一值。
unique(myd$v1)
[1] AB AA <NA> BB
Levels: AA AB BB
unique(myd$v2)
[1] CC CG GG <NA>
Levels: CC CG GG
Such unique values can be any combination however consists of two alphabets (-except NA). 这样的唯一值可以是任何组合,但是可以由两个字母组成(NA除外)。 For example "A", "B" in first case will make combination "AA", "AB", "BB".
例如,第一种情况下的“ A”,“ B”将组合为“ AA”,“ AB”,“ BB”。 The numberical code for these would be 1, 0, -1 respectively.
这些的数字代码分别为1、0,-1。 Similarly for second case alphabets "C", "G" makes "CC", "CG", "GG", thus the numberical codes would be 1, 0, -1 respectively.
同样,对于第二种情况,字母“ C”,“ G”表示“ CC”,“ CG”,“ GG”,因此数字代码分别为1、0,-1。 Thus the above myd need to be recoded to:
因此,上述myd需要重新编码为:
myd
v1 v2 v3 v4 v5
1 0 1 <NA> <NA> 1
2 0 0 -1 -1 1
3 1 -1 0 0 0
4 <NA> <NA> <NA> 0 <NA>
5 1 <NA> 1 < NA> 0
6 -1 <NA> -1 -1 -1
7 1 -1 1 0 0
8 <NA> -1 <NA> 0 0
9 1 <NA> 0 <NA> -1
10 1 -1 -1 1 -1
I will post a different solution -- (skip to data.table
for the superfast approach!) 我将发布一个不同的解决方案-(跳至
data.table
以获取超快方法!)
If you want to recode AA, AB, BB
, to 1,0,-1
etc you can use indexing (along with the factor to numeric solution). 如果要将
AA, AB, BB
编码为1,0,-1
等,则可以使用索引(以及数字解的系数)。 This will let you have a different recoding if you wish! 如果需要,这将使您有不同的重新编码!
simple_recode <- function(.x, new_codes){
new_codes[as.numeric(.x)]
}
as.data.frame(lapply( myd, simple_recode, new_codes = 1:-1))
factor
factor
You can simply relabel the letters by calling factor
with the new levels as labels
您可以通过调用带有新级别的
factor
作为labels
来简单地重新标记字母
as.data.frame(lapply(myd, factor, labels = 1:-1))
data.table
for efficiency data.table
效率 If your data is big, then I suggest a data.table
approach which will be memory and time efficient. 如果您的数据很大,那么我建议您使用
data.table
方法,这将data.table
内存并节省时间。
library(data.table)
DT <- as.data.table(myd)
as.data.table(DT[,lapply(.SD, simple_recode, new_codes = 1:-1))])
Or, more efficiently 或者,更有效
as.data.table(DT[, lapply(.SD, setattr, 'levels', 1:-1)])
Or, even more efficiently (modifying the levels in place, and avoiding the as.data.table call) 或者, 甚至更有效 (修改适当的级别,并避免使用as.data.table调用)
for(name in names(DT)){
setattr(DT[[name]],'levels',1:-1)
}
setattr
modifies by reference so no copying. setattr
通过引用进行修改,因此不能复制。
As demonstrated on this big dataset 如这个大数据集所示
# some big data (100 columns, 1e6 rows)
big <- replicate(100, factor(sample(c('AA','AB','BB', NA), 1e6, T)), simplify = F)
bigDT <- as.data.table(big)
system.time({
for(name in names(big)){
setattr(big[[name]],'levels',1:-1)
}
}))
## user system elapsed
## 0 0 0
You can take advantage of the fact that your data are factors, which have numeric indices underneath them. 您可以利用数据是因子的事实,这些因子在其下面具有数字索引。
For example: 例如:
> as.numeric(myd$v1)
[1] 2 2 1 NA 1 3 1 NA 1 1
The numeric values correspond to the levels()
of the factor: 数值对应于因子的
levels()
:
> levels(myd$v1)
[1] "AA" "AB" "BB"
So 1 == AA
, 2 == AB
, 3 == BB
...and so on. 所以1 ==
AA
,2 == AB
,3 == BB
...
So you can simply convert your data to numeric, and apply the necessary maths to get your data scaled how you want it. 因此,您可以简单地将数据转换为数字,然后应用必要的数学运算来按需缩放数据。 So we can subtract by 2, and then multiply by -1 to get your results:
因此,我们可以减去2,然后乘以-1以得到结果:
(sapply(myd, as.numeric) - 2) * -1
#-----
v1 v2 v3 v4 v5
[1,] 0 1 NA NA 1
[2,] 0 0 -1 -1 1
[3,] 1 -1 0 0 0
[4,] NA NA NA 0 NA
[5,] 1 NA 1 NA 0
[6,] -1 NA -1 -1 -1
[7,] 1 -1 1 0 0
[8,] NA -1 NA 0 0
[9,] 1 NA 0 NA -1
[10,] 1 -1 -1 1 -1
If you set up an assignment so the LHS has the proper structure, you can use the implicitly coerced values of the factors as indices into the values you want: 如果设置分配以使LHS具有适当的结构,则可以使用因子的隐式强制值作为所需值的索引:
> myd[] <- c(-1,0,1)[data.matrix(myd)]
> myd
v1 v2 v3 v4 v5
1 NA 0 0 0 1
2 -1 1 0 0 -1
3 0 NA 1 0 0
4 NA -1 -1 0 -1
5 -1 0 1 -1 NA
6 0 NA 0 1 NA
7 NA 0 1 NA -1
8 0 0 0 -1 1
9 -1 NA 1 -1 NA
10 0 1 1 NA NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.