基於一列重新編碼數據幀

Question

我有一個5845 * 1095（行*列）數據框，如下所示：

 9  286593   C     C/C     C/A     A/A
 9  334337   A     A/A     G/A     A/A
 9  390512   C     C/C     C/C     C/C

c <-  c("9", "286593", "C", "C/C", "C/A", "A/A") 
d <-  c("9", "334337", "A", "A/A", "G/A", "A/A")
e <-   c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))

我希望第三列中的值用於將列更改為右側，因此如果（每行1）列3為“C”，則第4列從“C / C”變為“0”，因為它具有同一封信。 一個字母匹配為“1”（可以是第一個或第二個字母），字母匹配不是“2”。

9 286593  C  0  1  2
9 334337  A  0  1  0
9 390512  C  0  0  0 

c <-  c("9", "286593", "C", "0", "1", "2") 
d <-  c("9", "334337", "A", "0", " 1", "0")
e <-   c("9", "390512", "C", "0", "0", "0")
dat <- data.frame(rbind(c,d,e))

我有興趣看到最好的方法，因為我想擺脫在R中使用嵌套For循環的習慣。

Answer 1

首先你的數據：

c <-  c("9", "286593", "C", "C/C", "C/A", "A/A")
# Note: In your original data, you had a space in "G/A", which I did remove. 
# If this was no mistake, we would also have to deal with the space.
d <-  c("9", "334337", "A", "A/A", "G/A", "A/A")
e <-   c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))

現在我們生成一個包含所有可能字母的向量。

values <- c("A", "C", "G", "T")
dat$X3 <- factor(dat$X3, levels=values) # This way we just ensure that it will later on be possible to compare the reference values to our generated data. 

# Generate all possible combinations of two letters
combinations <- expand.grid(f=values, s=values)
combinations <- cbind(combinations, v=with(combinations, paste(f, s, sep='/')))

main函數找到每列的每個組合的正確列，然后將其與參考列3進行比較。

compare <- function(col, val) {
    m <- match(col, combinations$v)
    2 - (combinations$f[m] == val) - (combinations$s[m] == val)
}

最后，我們使用apply在所有必須更改的列上運行該函數。 您可能希望將6更改為實際的列數。

dat[,4:6] <- apply(dat[,4:6], 2, compare, val=dat[,3])

請注意，此解決方案與迄今為止的其他解決方案相比，不使用字符串比較，而是純粹基於因子級別的方法。 看看哪一個表現更好會很有趣。

編輯

我剛做了一些基准測試：

    test replications elapsed relative user.self sys.self user.child sys.child
1   arun      1000000   2.881    1.116     2.864    0.024          0         0
2  fabio      1000000   2.593    1.005     2.558    0.030          0         0
3 roland      1000000   2.727    1.057     2.687    0.048          0         0
5  thilo      1000000   2.581    1.000     2.540    0.036          0         0
4  tyler      1000000   2.663    1.032     2.626    0.042          0         0

這讓我的版本稍快一些。 然而，差異幾乎沒有，所以你可能對每一種方法都很好。 並且公平地說：我沒有對添加額外因子水平的部分進行基准測試。 這樣做可能會排除我的版本。

Answer 2

這是一個方法：

FUN <- function(x) {
    a <- strsplit(as.character(unlist(x[-1])), "/")
    b <- sapply(a, function(y) sum(y %in% as.character(unlist(x[1]))))
    2 - b
}

dat[4:6] <-  t(apply(dat[, 3:6], 1, FUN))

## > dat
##   X1     X2 X3 X4 X5 X6
## c  9 286593  C  0  1  2
## d  9 334337  A  0  1  0
## e  9 390512  C  0  0  0

Answer 3

這是使用apply的一種方式：

out <- apply(dat[, -(1:2)], 1, function(x) 
        2 - grepl(x[1], x[-1]) -  
        x[-1] %in% paste(x[1], x[1], sep="/"))
cbind(dat[, (1:3)], t(out))

Answer 4

這個解決方案效率不高：

dat <-  cbind(dat[,-(4:6)],
              t(sapply(seq_len(nrow(dat)),function(i){
                res <- dat[i,]
                res[,4:6] <- lapply(res[,4:6],function(x) 2-sum(gregexpr(res[,3],x)[[1]]>0))
              })))

#  X1     X2 X3 X4 X5 X6
#c  9 286593  C  0  1  2
#d  9 334337  A  0  1  0
#e  9 390512  C  0  0  0

Answer 5

丑陋，但它的作品！

fff<-apply(dat[,4:ncol(dat)],2,substr,1,1)!=dat[,3]
ggg<-apply(dat[,4:ncol(dat)],2,substr,3,3)!=dat[,3]
final<-fff+ggg
cbind(dat,final)
X1     X2 X3  X4  X5  X6 X4 X5 X6
c  9 286593  C C/C C/A A/A  0  1  2
d  9 334337  A A/A G/A A/A  0  1  0
e  9 390512  C C/C C/C C/C  0  0  0

Answer 6

對R-golf的另一個貢獻：

cbind(dat[, 1:3],
      apply(dat[, -(1:3)], 2, function(x) {
        2 - (dat[[3]] == gsub('..$', '', x)) - (dat[[3]] == gsub('^..', '', x))
      }))

基於一列重新編碼數據幀

問題描述

6 個解決方案

解決方案1
5 已采納 2013-06-24 15:24:24

編輯

解決方案2
4 2013-06-24 15:25:32

解決方案3
4 2013-06-24 15:31:21

解決方案4
3 2013-06-24 15:26:48

解決方案5
2 2013-06-24 15:28:18

解決方案6
2 2013-06-24 20:21:04

基於一列重新編碼數據幀

問題描述

6 個解決方案

解決方案1 5 已采納 2013-06-24 15:24:24

編輯

解決方案2 4 2013-06-24 15:25:32

解決方案3 4 2013-06-24 15:31:21

解決方案4 3 2013-06-24 15:26:48

解決方案5 2 2013-06-24 15:28:18

解決方案6 2 2013-06-24 20:21:04

解決方案1
5 已采納 2013-06-24 15:24:24

解決方案2
4 2013-06-24 15:25:32

解決方案3
4 2013-06-24 15:31:21

解決方案4
3 2013-06-24 15:26:48

解決方案5
2 2013-06-24 15:28:18

解決方案6
2 2013-06-24 20:21:04