[英]collapse dataframe based on rows and calculate weighted mean r
我想折疊以下數據框
df
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92
通過僅匹配具有相同Cn和mCn值的連續行,我希望折疊這些行。 例如,對於前4行,我們有以下內容:
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
我想折疊具有相同Cn和mCn分數的連續行,因此對於前三行,在Cn和mCn列上分別分別具有“ 2”和“ 1”,並更改End列以反映這一點坍方。
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 9598904 8.023 0.026905119 -0.001671481 2 1 1.99
但我也想更改Cn_column
,使其成為加權平均值Cn_dependant
,具體lengthMB
該行的lengthMB
得分。 因此,對於前三行,計算將為:
((8.023/9.599) * 1.99) + ((1.145/9.599) * 2) + ((0.431/9.599) * 1.91) = 1.987
前四個獨特染色體的輸出:
Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 9598904 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9598904 1.145 0.030441784 0.000601976 2 1 1.79
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9598904 8.185 0.033499045 -0.031338811 1 0 0.836
chr3 9598904 35402000 0.883 0.049003807 -0.021413391 2 1 2.02
chr4 0 9598904 11.065 0.035092332 -0.022844471 2 1 2
嘗試過這樣的事情,但我也不知道如何包括計算...
squish_segments <- function(sample) {
setDT(sample)[, .ind:= cumsum(c(TRUE,Start[-1]!=End[-.N])),
list(lengthMB, probes, snps, imba, log2, Cn, mCn, Cn_)][,
list(Chr=Chromosome[1], Start=Start[1], End=End[.N]),
list(lengthMB, probes, snps, imba, log2, Cn, mCn, Cn_, .ind)][,.ind:=NULL][]
}
首先,請通過提供數據集的dput
輸出使您的問題更容易重現。
我認為這是您所需要的。
setkey(df, Chromosome, Cn, mCn, Start)
df[, list(
Start=min(Start),
End=max(End),
lengthMB=lengthMB[1],
imba=imba[1],
log2=log2[1],
Cn_=weighted.mean(Cn_, lengthMB)
), keyby=list(Chromosome, Cn , mCn)]
這是dplyr
方法。
library(dplyr)
df = read.table(text=
"Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92", header=T)
df %>%
mutate(Consec = ifelse(Chromosome == dplyr::lag(Chromosome, default = Chromosome[1]) & ## flag consecutive matching chromosomes
Cn == dplyr::lag(Cn, default = Cn[1]) &
mCn == dplyr::lag(mCn, default = mCn[1]), 0, 1),
Consec = cumsum(Consec)) %>% ## create an id for consecutive matching chromosomes
group_by(Chromosome, Cn, mCn, Consec) %>%
summarize(Cn_ = sum(lengthMB * Cn_)/sum(lengthMB),
Start = min(Start),
End = max(End),
lengthMB = first(lengthMB),
imba= first(imba),
log2= first(log2)) %>%
ungroup() %>% ## only if you want to ungroup
select(Chromosome,Start,End, lengthMB,imba,log2,Cn,mCn,Cn_) %>% ## to re arrange column order
arrange(Chromosome, Start)
# Chromosome Start End lengthMB imba log2 Cn mCn Cn_
# (fctr) (int) (int) (dbl) (dbl) (dbl) (int) (int) (dbl)
# 1 chr1 0 9598904 8.023 0.02690512 -0.001671481 2 1 1.9876008
# 2 chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.0100000
# 3 chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.8900000
# 4 chr2 8022930 9598904 1.145 0.03044178 0.000601976 2 1 1.7879569
# 5 chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.8700000
# 6 chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.0100000
# 7 chr3 0 8040000 1.479 NA 0.000969256 2 1 2.0000000
# 8 chr3 8040000 9598904 8.185 0.03349904 -0.031338811 1 0 0.8541823
# 9 chr3 9598904 35402000 0.883 0.04900381 -0.021413391 2 1 2.0187143
# 10 chr4 0 9598904 11.065 0.03509233 -0.022844471 2 1 1.9956599
請注意, lag
是dplyr
函數,也是stats
包函數。 我必須寫dplyr::lag
否則當我嘗試在lag
內指定default =
時會發生沖突。 我不知道您或其他任何人都可以復制此問題。
可以識別唯一的“事件”(具有相同Cn和mCn分數的連續行),然后簡單地遍歷這些事件並相應地修改行。 不是最有效的,但是應該做的。
txt <- "Chromosome Start End lengthMB imba log2 Cn mCn Cn_
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01
chr3 0 8040000 1.479 NA 0.000969256 2 1 2
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92"
df <- read.table(text=txt, header=T)
#identify each unique event
df$eventid <- with(df, cumsum(c(1,diff(as.numeric(factor(Chromosome)))!=0 | diff(Cn)!=0 | diff(mCn)!=0)))
#loop through events
for(i in 1:max(df$eventid)){
#identify rows in df with ith event
rows.i <- which(df$eventid == i)
df[rows.i,] <- within(df[rows.i,],{
#calculate values of interest and assign to first row of event
Start[1] <- min(Start)
End[1] <- max(End)
Cn_[1] <- sum((lengthMB/sum(lengthMB))*Cn_)
lengthMB[1] <- sum(lengthMB)
})
#drop all but first row
if(length(rows.i) > 1) df <- df[-rows.i[-1],]
} #end i
結果
> df
Chromosome Start End lengthMB imba log2 Cn mCn Cn_ eventid
1 chr1 8022945 9598904 1.576 0.03044178 0.000601976 2 1 1.9753871 1
3 chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.0100000 2
4 chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.8900000 3
5 chr2 8022930 9598904 1.576 0.03044178 0.000601976 2 1 1.7879569 4
7 chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.8700000 5
8 chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.0100000 6
9 chr3 0 8040000 1.479 NA 0.000969256 2 1 2.0000000 7
10 chr3 8040000 9598904 12.137 0.03349904 -0.031338811 1 0 0.8541823 8
12 chr3 9598904 35402000 4.978 0.04900381 -0.021413391 2 1 2.0187143 9
14 chr4 0 9598904 52.245 0.03509233 -0.022844471 2 1 1.9956599 10
如果我正確理解了您的問題,則可以使用data.table
快速分組將其data.table
一行。
library(data.table)
dt[, Cn_dependent := sum((lengthMB/sum(lengthMB)) * Cn_),
by = .(Chromosome, Cn, mCn)]
得到這個:
> dt
Chromosome Start End lengthMB imba log2 Cn mCn Cn_ Cn_dependent
1: chr1 0 8022945 8.023 0.02690512 -0.001671481 2 1 1.99 1.987601
2: chr1 8022945 9168284 1.145 0.03044178 0.000601976 2 1 2.00 1.987601
3: chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91 1.987601
4: chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.01 3.010000
5: chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.89 2.890000
6: chr2 8022930 9168284 1.145 0.03044178 0.000601976 2 1 1.87 1.882285
7: chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57 1.882285
8: chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.87 1.870000
9: chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.01 1.882285
要通過Chromosome
, Cn
和mCn
,可以使用鍵和unique
鍵。
> setkey(dt, "Chromosome", "Cn", "mCn")
> unique(dt)
Chromosome Start End lengthMB imba log2 Cn mCn Cn_ Cn_dependent
1: chr1 0 8022945 8.023 0.02690512 -0.001671481 2 1 1.99 1.987601
2: chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.01 3.010000
3: chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.87 1.870000
4: chr2 8022930 9168284 1.145 0.03044178 0.000601976 2 1 1.87 1.882285
5: chr2 0 8022930 8.023 0.02690512 -0.001671481 3 1 2.89 2.890000
這里是一個dput
中的data.table
我開始:
> dput(dt)
structure(list(Chromosome = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("chr1", "chr2"), class = "factor"), Start = c(0L,
8022945L, 9168284L, 9598904L, 0L, 8022930L, 9168284L, 9598904L,
31392788L), End = c(8022945L, 9168284L, 9598904L, 31392788L,
8022930L, 9168284L, 9598904L, 31392788L, 35402000L), lengthMB = c(8.023,
1.145, 0.431, 21.794, 8.023, 1.145, 0.431, 21.794, 1.164), imba = c(0.026905119,
0.030441784, NA, 0.036011994, 0.026905119, 0.030441784, NA, 0.036011994,
0.029733771), log2 = c(-0.001671481, 0.000601976, -0.024952441,
0.002151497, -0.001671481, 0.000601976, -0.024952441, 0.002151497,
0.003149921), Cn = c(2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L), mCn = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L), Cn_ = c(1.99, 2, 1.91, 3.01,
2.89, 1.87, 1.57, 1.87, 2.01)), .Names = c("Chromosome", "Start",
"End", "lengthMB", "imba", "log2", "Cn", "mCn", "Cn_"), class = c("data.table",
"data.frame"), row.names = c(NA, -9L), .internal.selfref = <pointer: 0x26abf68>)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.