I want to compute a mean from a data frame in R. The file represents the output of coverage (column 4) over ranges (columns 2,3) of a chromosome (column 1).
The data looks like this:
V1 V2 V3 V4
1 65 69 103
1 69 70 107
1 70 74 108
1 74 75 110
1 75 77 111
1 77 78 113
1 78 79 115
1 79 80 118
1 80 81 119
I want to compute the mean coverage over all of the file. On paper, this looks like: [103*(69-65)+107(70-69)+108(74-70)+ ... + V4(V3-V2)]/(lengthOfChromosome)
The lengthOfChromosome is known.
I've been searching for a solution, and the closest thing I've found is the row-wise operators in the apply()
family. These don't seem particularly well suited for the task since most of their outputs appear to be either matrices or lists or vectors. My goal is to get a single statistic: the mean. I also might be interested in the standard deviation, but that is less important now.
Any tips in the right direction would be appreciated!
You don't even need apply()
here. Most operators in R operate in a vectorized manner. So if your data is in a data.frame called dd
dd<-structure(list(V1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), V2 = c(65L,
69L, 70L, 74L, 75L, 77L, 78L, 79L, 80L), V3 = c(69L, 70L, 74L,
75L, 77L, 78L, 79L, 80L, 81L), V4 = c(103L, 107L, 108L, 110L,
111L, 113L, 115L, 118L, 119L)), .Names = c("V1", "V2", "V3",
"V4"), class = "data.frame", row.names = c(NA, -9L))
Then you can get the numerator of your equation with a simple
with(dd, sum(V4*(V3-V2)))
(here we use with()
so we don't have to write dd$
a bunch of times.) And assuming the lenght of the chromosome is just the max end less the min start then
with(dd, sum(V4*(V3-V2))/(max(V3)-min(V2)))
如果dat
是您的data.frame,并且V1
仅是1
:
with(dat, sum(V4*(V3-V2))) / (lengthOfChromosome)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.