简体   繁体   English

如何在R中执行行运算以生成单个统计信息

[英]How to perform row operations in R to produce a single statistic

I want to compute a mean from a data frame in R. The file represents the output of coverage (column 4) over ranges (columns 2,3) of a chromosome (column 1). 我想从R中的数据帧计算平均值。该文件表示染色体(第1列)范围(第2,3列)的覆盖率(第4列)的输出。

The data looks like this: 数据如下所示:

V1  V2  V3   V4
 1  65  69  103
 1  69  70  107
 1  70  74  108
 1  74  75  110
 1  75  77  111
 1  77  78  113
 1  78  79  115
 1  79  80  118
 1  80  81  119

I want to compute the mean coverage over all of the file. 我想计算所有文件的平均覆盖率。 On paper, this looks like: [103*(69-65)+107(70-69)+108(74-70)+ ... + V4(V3-V2)]/(lengthOfChromosome) 在纸上看起来像:[103 *(69-65)+107(70-69)+108(74-70)+ ... + V4(V3-V2)] /(染色体长度)

The lengthOfChromosome is known. 染色体的长度是已知的。

I've been searching for a solution, and the closest thing I've found is the row-wise operators in the apply() family. 我一直在寻找解决方案,而我发现最接近的是apply()系列中的按行运算符。 These don't seem particularly well suited for the task since most of their outputs appear to be either matrices or lists or vectors. 这些似乎并不特别适合该任务,因为它们的大多数输出​​似乎是矩阵,列表或向量。 My goal is to get a single statistic: the mean. 我的目标是得到一个统计量:均值。 I also might be interested in the standard deviation, but that is less important now. 我也可能对标准偏差感兴趣,但是现在这已经不那么重要了。

Any tips in the right direction would be appreciated! 朝正确方向的任何提示将不胜感激!

You don't even need apply() here. 您甚至不需要在这里apply() Most operators in R operate in a vectorized manner. R中的大多数运算符都以矢量化方式进行操作。 So if your data is in a data.frame called dd 因此,如果您的数据位于名为dd的data.frame中

dd<-structure(list(V1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), V2 = c(65L, 
69L, 70L, 74L, 75L, 77L, 78L, 79L, 80L), V3 = c(69L, 70L, 74L, 
75L, 77L, 78L, 79L, 80L, 81L), V4 = c(103L, 107L, 108L, 110L, 
111L, 113L, 115L, 118L, 119L)), .Names = c("V1", "V2", "V3", 
"V4"), class = "data.frame", row.names = c(NA, -9L))

Then you can get the numerator of your equation with a simple 然后,您可以使用简单的公式获得分子的分子

with(dd, sum(V4*(V3-V2)))

(here we use with() so we don't have to write dd$ a bunch of times.) And assuming the lenght of the chromosome is just the max end less the min start then (这里我们使用with()因此我们不必写dd$一堆。)并假设染色体的长度只是最大末端减去最小末端,然后

with(dd, sum(V4*(V3-V2))/(max(V3)-min(V2)))

如果dat是您的data.frame,并且V1仅是1

with(dat, sum(V4*(V3-V2))) / (lengthOfChromosome)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM