簡體   English   中英

如何將 function 應用於 2 個數據幀之間的分組行?

[英]How to apply a function to grouped rows between 2 dataframes?

我有 2 個 g.netic 數據的數據幀,我希望在我的 2 個數據集中的所有表型之間運行超幾何測試 function(使用GeneOverlap package 作為測試函數)。 我正在嘗試自動執行此過程並將每個表型的結果存儲在一個新的數據框中,但我堅持對兩個數據框中的所有表型自動執行 function。

我的數據集如下所示:

數據集 1:

Gene      Gene_count   Phenotype
Gene1          5       Phenotype1
Gene1          5       Phenotype2
Gene2          3       Phenotype1
Gene3         16       Phenotype6
Gene3.        16       Phenotype2
Gene3         16       Phenotype1

數據集2:

Gene    Gene_count     Phenotype
Gene1         10       Phenotype1
Gene1         10       Phenotype2
Gene4         4        Phenotype1
Gene2         17       Phenotype6
Gene6         3        Phenotype2
Gene7         2        Phenotype1

目前我一次運行一個超幾何測試,看起來像這樣:

dataset1_pheno1 <- dataset1  %>%
  filter(str_detect(Phenotype, 'Phenotype1'))

dataset2_pheno1 <- dataset2  %>%
  filter(str_detect(Phenotype, 'Phenotype1'))

go.obj <- newGeneOverlap(dataset1_pheno1$Gene, 
                         dataset2_pheno1$Gene,
                         genome.size=1871)
go.obj <- testGeneOverlap(go.obj)
go.obj 

我想為 2 個數據集中的每個表型重復這個 function,到目前為止,我一直在嘗試在 Dplyr 中使用 group_by() function,然后嘗試在其中運行 Geneoverlap function,但我一直無法獲得這個工作。 我可以使用哪些函數按 2 個數據集中的列和行進行分組,然后一次運行一組函數?

輸入數據示例:

library(GeneOverlap)
library(dplyr)
library(stringr)

dataset1 <- structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene3.", 
"Gene3"), Gene_count = c(5L, 5L, 3L, 16L, 16L, 16L), Phenotype = c("Phenotype1", 
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))


dataset2 <- structure(list(Gene = c("Gene1", "Gene1", "Gene4", "Gene2", "Gene6", 
"Gene7"), Gene_count = c(10L, 10L, 4L, 17L, 3L, 2L), Phenotype = c("Phenotype1", 
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))

您可以按“表型”將每個數據集split成列表,然后使用Map對每個集運行測試。 但請注意,每個數據集必須以相同的順序具有相同數量的獨特表型。 換句話說, all(names(d1_split) == names(d2_split))必須為真。

d1_split <- split(dataset1, dataset1$Phenotype)
d2_split <- split(dataset2, dataset2$Phenotype)

# this should be TRUE in order for Map to work correctly
all(names(d1_split) == names(d2_split))

tests <- Map(function(d1, d2) {
  go.obj <- newGeneOverlap(d1$Gene, d2$Gene, genome.size = 1871)
  return(testGeneOverlap(go.obj))
}, d1_split, d2_split)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM