[英]Subsetting a data frame based on another data frame with multiple conditions
我有一個甲基化數組數據框的列表,如下所示,稱為betatable
:
sample_A sample_B ... chr position
0.5 0.3 chr1 75939
0.3 0.6 chr2 11195
...
我想通過chr和位置范圍的特定條件對上述數據幀進行子集並生成另一個數據幀。 為此,我有另一組數據genes_pos
:
gene chr range_lower range_upper
ABC chr1 34959 69593
...
我當時在考慮使用lapply
但無法弄清楚。 提前謝謝了。
在此示例中,您可以使用dplyr :: inner_join
可重現的示例:
set.seed(123)
x <- data.frame(x = sample(1:100, 100, replace = TRUE), y = sample(1:100, 100, replace = TRUE), chr = sample(c("chr1", "chr2", "chr3"), 100, replace = T), Position = sample(1:10000, 100, replace = TRUE))
genes <- data.frame(gene = c("gene1", "gene2", "gene3"), chr = c("chr1", "chr2", "chr3"), rangelower = c(1, 3000, 6000), rangeupper = c(2999, 5999, 10001))
內部 聯接 ,然后按上限和下限過濾
library(dplyr)
new_df <- x %>%
inner_join(genes, by = "chr") %>%
filter(Position < rangeupper, Position > rangelower)
查看結果:
> head(new_df)
x y chr Position gene rangelower rangeupper
1 90 61 chr1 83 gene1 1 2999
2 96 94 chr2 3896 gene2 3000 5999
3 90 15 chr3 8029 gene3 6000 10001
4 96 41 chr3 8569 gene3 6000 10001
5 100 22 chr3 7040 gene3 6000 10001
6 66 37 chr1 1039 gene1 1 2999
然后我們可以按基因分割數據框。
list_dfs <- split(new_df, new_df$gene)
一種方法是使用非等聯接 。
但是,由於位置是作為因子而不是整數給出的,因此需要准備OP在現在刪除的帖子中提供的樣本數據集
library(data.table)
# prepare data
setDT(betatable, keep.rownames = "sample.id")
setDT(gene_pos)
# coerce positions from factor to integer
betatable[, pos := as.integer(as.character(pos))]
cols <- c("lower", "upper")
gene_pos[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
# non-equi join
betatable[gene_pos, on = .(chr, pos >= lower, pos <= upper), gene := i.gene][!is.na(gene)]
sample.id probe chr pos gene 1: sample_a 111 chr1 335 geneA 2: sample_c 200 chr2 221 geneB 3: sample_e 228 chr2 230 geneC
column <-c("probe","chr","pos")
sample_a <- c("111","chr1","335")
sample_b <- c("115","chr1","380")
sample_c <- c("200","chr2","221")
sample_d <- c("222","chr2","226")
sample_e <- c("228","chr2","230")
betatable <-data.frame(rbind(sample_a,sample_b,sample_c,sample_d,sample_e))
colnames(betatable)<- column
gene_A <- c("geneA","chr1", "120","336")
gene_B <- c("geneB","chr2", "200","222")
gene_C <- c("geneC","chr2", "227","231")
gene_pos <- rbind(gene_A,gene_B,gene_C)
gene_pos <- data.frame(rbind(gene_A,gene_B,gene_C))
colnames(gene_pos)<-c("gene","chr","lower","upper")
betatable
probe chr pos sample_a 111 chr1 335 sample_b 115 chr1 380 sample_c 200 chr2 221 sample_d 222 chr2 226 sample_e 228 chr2 230
gene_pos
gene chr lower upper gene_A geneA chr1 120 336 gene_B geneB chr2 200 222 gene_C geneC chr2 227 231
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.