[英]Pairwise Comparison of Rows in R
我有一個數據集,其中包含許多樣本中許多測試的結果。 樣本在數據集中復制。 我想比較每組重復樣本中重復樣本之間的測試結果。 我認為首先按SampleID拆分數據幀可能是最簡單的方法,這樣我可以獲得一個數據幀列表,每個SampleID都有一個數據幀。 一個樣本可能有2、3、4甚至5個重復,因此每個樣本組要比較的唯一行組合的數量是不同的。 我具有下面所闡述的邏輯。 我想在數據幀列表上運行一個函數並輸出匹配結果。 該函數將比較每組重復樣本中2行的唯一集合,並返回“ Match”,“ Mismatch”或NA值(如果缺少一個或兩個測試值)。 它還將返回兩個比較重復之間重疊的測試計數,匹配數和不匹配數。 最后,它將在其中包含一列樣本名稱及其行號粘貼在一起的列,這樣我就知道比較了兩個樣本(例如Sample1.1_Sample1.2)。 有人能指出我正確的方向嗎?
#Input data structure
data = as.data.frame(cbind(rbind("Sample1","Sample1","Sample2","Sample2","Sample2"),rbind("A","A","C","C","C"), rbind("A","T","C","C","C"),
rbind("A",NA,"C","C","C"), rbind("A","A","C","C","C"), rbind("A","T","C","C",NA), rbind("A","A","C","C","C"),
rbind("A","A","C","C","C"), rbind("A",NA,"C","T","T"), rbind("A","A","C","C","C"), rbind("A","A","C","C","C")))
colnames(data) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10")
data
data.split = split(data, data$SampleID)
##Row comparison function
#Input is a list of data frames. Each data frame contains results for replicates of the same sample.
RowCompare = function(x){
rowcount = nrow(x)
##ifelse(rowcount==2,
##compare row 1 to row 2
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
#ifelse(rowcount==3,
##compare row 1 to row 2
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
##compare row 1 to row 3
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
##compare row 2 to row 3
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
return(results)
}
#Output is a list of data frames - one for sample name
out = lapply(names(data.split), function(x) RowCompare(data.split[[x]]))
#Row bind the list of data frames back together to one large data frame
out.merge = do.call(rbind.data.frame, out)
head(out.merge)
#Desired output
out.merge = as.data.frame(cbind(rbind("Sample1.1_Sample1.2","Sample2.1_Sample2.2","Sample2.1_Sample2.3","Sample2.2_Sample2.3"),rbind("Match","Match","Match","Match"),
rbind("Mismatch","Match","Match","Match"), rbind(NA,"Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind("Mismatch","Match",NA,NA),
rbind("Match","Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind(NA,"Mismatch","Mismatch","Match"), rbind("Match","Match","Match","Match"),
rbind("Match","Match","Match","Match"), rbind(8,10,9,9), rbind(6,9,8,8), rbind(2,1,1,1)))
colnames(out.merge) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10", "Num_Overlap", "Num_Match","Num_Mismatch")
out.merge
我確實在另一篇文章中看到的一件我認為可能有用的事情是,下面的行將創建一個具有唯一行組合的數據框,然后可以使用該行定義在各組復制樣本中要比較的行。 雖然不確定如何實現它。
t(combn(nrow(data),2))
謝謝。
t(combn(nrow(data),2))
使您處在正確的軌道上。 請參閱下面的說明。
testCols <- which(grepl("^Test\\d+",colnames(data)))
TestsCompare=function(x,y){
##how many non-NA values overlap
overlaps <- sum(!is.na(x) & !is.na(y))
##of those that overlap, how many match
matches <- sum(x==y, na.rm=T)
##of those that overlap, how many do not match
non_matches <- overlaps - matches # complement of matches
c(overlaps,matches,non_matches)
}
RowCompare= function(x){
comp <- NULL
pairs <- t(combn(nrow(x),2))
for(i in 1:nrow(pairs)){
row_a <- pairs[i,1]
row_b <- pairs[i,2]
a_tests <- x[row_a,testCols]
b_tests <- x[row_b,testCols]
comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
}
colnames(comp) <- c("row_a","row_b","overlaps","matches","non_matches")
return(comp)
}
out = lapply(data.split, RowCompare)
生產:
> out
$Sample1
row_a row_b overlaps matches non_matches
[1,] 1 2 8 6 2
$Sample2
row_a row_b overlaps matches non_matches
[1,] 1 2 10 9 1
[2,] 1 3 9 8 1
[3,] 2 3 9 9 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.