简体   繁体   English

如何在R中检测基因组数据中的dist和kNN?

[英]How to peform dist and kNN in R for genomic data?

I have genomic data with missing values and I want to calculate the distance between the expression levels of each pair of genes by using the available values. 我有缺失值的基因组数据,我想通过使用可用值来计算每对基因的表达水平之间的距离。 Then i want to discover the K nearest neighbors to fill the gaps? 然后我想发现K最近邻居填补空白? How I can do that in R? 我怎么能在R中做到这一点?

gene sample 1   sample 2    sample 3    sample 4
1      5555        NA          2151       5484    
2      5564        NA            NA        NA
3      4544       4656         14546       45455   
4      NA         54654           NA        NA

... How I can calculate the eucledian distance? ...我如何计算eucledian距离? I need to use a just one row at the time? 我当时只需要使用一排?

Sorry I´m new with genomic data and I can´t find this information anywhere. 对不起我是基因组数据的新手,我无法在任何地方找到这些信息。

Thanks. 谢谢。

I guess what you are trying to do is knn-imputation for the missing values, not knn-classification. 我想你要做的是对缺失值进行估算,而不是knn-classification。 There is a ready made function for this called impute.knn from the impute package on the bioconductor . impute.knnimpute包中有一个现成的函数叫做bioconductor Read the helpfile closely before use. 使用前请仔细阅读帮助文件。

source("http://bioconductor.org/biocLite.R")
biocLite("impute")
require(impute)

x <- rnorm(1000, 50, 5)  # 1000 random samples
x[sample(1:1000, 50)] <- NA  # 50 are randomly made NA
x <- matrix(x, nrow = 10)  # make a matrix
impute.knn(x)

Googling for R k nearest neighbor leads me to the knn function in the class package. 谷歌搜索R k nearest neighbor引导我到类包中的knn函数。 In regard to your second question, calculating the euclidian distance is simply: 关于你的第二个问题,计算欧几里德距离很简单:

sqrt((sample1_x - sample1_y)^2 + ... + (sample4_x - sample4_y)^2)

where x and y are the indices of the rows you want to calculate the distance between. 其中xy是要计算其间距离的行的索引。 However, you have a lot of NA's in your data, I'm not sure how you need to deal with that as the euclidean distance is undefined when there are NA's involved. 但是,你的数据中有很多NA,我不知道你需要如何处理它,因为当涉及NA时,欧几里德距离是不确定的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM