簡體   English   中英

如何根據另一個數據對 R 中的數據幀進行子集化

[英]How to subset dataframe in R based on another data

我有一個包含大量 RNA seq 計數的數據框(樣本名作為列名,基因作為行名),以及一個元數據文件,即性別、組織類型、疾病狀態等(樣本名作為行名和性別等列名)我想要只包含 2 種組織類型的 RNAseq 計數數據的子集,以便我可以查看 DGE。 有人可以建議最好的方法嗎? 我對處理 RNA seq 數據非常陌生,所以這可能很明顯!

謝謝!

編輯:有 >1000 個樣本,因此通過列名對列進行子集化可能不准確

希望這可以讓您深入了解計數數據:

dput(head(tpm.df[1:2])) 
structure(list(Description = c("DDX11L1", "WASH7P", "MIR6859-1", 
"MIR1302-2HG", "FAM138A", "OR4G4P"), `GTEX-1117F-0226-SM-5GZZ7` = c(0L, 
187L, 0L, 1L, 0L, 0L)), row.names = c("ENSG00000223972.5", 
"ENSG00000227232.5", 
"ENSG00000278267.1", "ENSG00000243485.5", "ENSG00000237613.2", 
"ENSG00000268020.3"), class = "data.frame")

這是元數據:

structure(list(SMATSSCR = c(NA, NA, NA, NA, NA, 0L), SMCENTER = c("B1", 
"B1", "B1", "B1, A1", "B1, A1", "B1"), SMPTHNTS = c("", "", "", 
"", "", "2 pieces, ~15% vessel stroma, rep delineated")), row.names = 
c("GTEX-1117F-0003-SM-58Q7G", 
"GTEX-1117F-0003-SM-5DWSB", "GTEX-1117F-0003-SM-6WBT7", "GTEX-1117F- 
0011-R10a-SM-AHZ7F", 
"GTEX-1117F-0011-R10b-SM-CYKQ8", "GTEX-1117F-0226-SM-5GZZ7"), class = 
"data.frame")

您的“元數據”數據框中是否有“組織”列? 如果是這樣,您可以使用它來子集您的“元數據”數據框,然后使用它來子集您的 tpm 值,例如

tpm.df <-
  structure(
    list(
      Description = c(
        "DDX11L1",
        "WASH7P",
        "MIR6859-1",
        "MIR1302-2HG",
        "FAM138A",
        "OR4G4P"
      ),
      `GTEX-1117F-0226-SM-5GZZ7` = c(0L, 187L, 0L, 1L, 0L, 0L)
    ),
    row.names = c(
      "ENSG00000223972.5",
      "ENSG00000227232.5",
      "ENSG00000278267.1",
      "ENSG00000243485.5",
      "ENSG00000237613.2",
      "ENSG00000268020.3"
    ),
    class = "data.frame"
  )

metadata <- structure(list(SMATSSCR = c(NA, NA, NA, NA, NA, 0L), 
                           SMCENTER = c("B1", "B1", "B1", "B1, A1", "B1, A1", "B1"), 
                           SMPTHNTS = c("", "", "",  "", "", "2 pieces, ~15% vessel stroma, rep delineated"),
                           TISSUE = c("Adipose", "Skin", "Adipose", "Muscle", "Skin", "Nerve")),
                      row.names = c("GTEX-1117F-0003-SM-58Q7G", "GTEX-1117F-0003-SM-5DWSB", "GTEX-1117F-0003-SM-6WBT7", "GTEX-1117F-0011-R10a-SM-AHZ7F", 
"GTEX-1117F-0011-R10b-SM-CYKQ8", "GTEX-1117F-0226-SM-5GZZ7"), class = 
  "data.frame")

tpm.df
#>                   Description GTEX-1117F-0226-SM-5GZZ7
#> ENSG00000223972.5     DDX11L1                        0
#> ENSG00000227232.5      WASH7P                      187
#> ENSG00000278267.1   MIR6859-1                        0
#> ENSG00000243485.5 MIR1302-2HG                        1
#> ENSG00000237613.2     FAM138A                        0
#> ENSG00000268020.3      OR4G4P                        0
metadata
#>                               SMATSSCR SMCENTER
#> GTEX-1117F-0003-SM-58Q7G            NA       B1
#> GTEX-1117F-0003-SM-5DWSB            NA       B1
#> GTEX-1117F-0003-SM-6WBT7            NA       B1
#> GTEX-1117F-0011-R10a-SM-AHZ7F       NA   B1, A1
#> GTEX-1117F-0011-R10b-SM-CYKQ8       NA   B1, A1
#> GTEX-1117F-0226-SM-5GZZ7             0       B1
#>                                                                   SMPTHNTS
#> GTEX-1117F-0003-SM-58Q7G                                                  
#> GTEX-1117F-0003-SM-5DWSB                                                  
#> GTEX-1117F-0003-SM-6WBT7                                                  
#> GTEX-1117F-0011-R10a-SM-AHZ7F                                             
#> GTEX-1117F-0011-R10b-SM-CYKQ8                                             
#> GTEX-1117F-0226-SM-5GZZ7      2 pieces, ~15% vessel stroma, rep delineated
#>                                TISSUE
#> GTEX-1117F-0003-SM-58Q7G      Adipose
#> GTEX-1117F-0003-SM-5DWSB         Skin
#> GTEX-1117F-0003-SM-6WBT7      Adipose
#> GTEX-1117F-0011-R10a-SM-AHZ7F  Muscle
#> GTEX-1117F-0011-R10b-SM-CYKQ8    Skin
#> GTEX-1117F-0226-SM-5GZZ7        Nerve

# One way to find samples of interest
subset_adipose_samples <- metadata[metadata$TISSUE %in% c("Adipose"),]
subset_adipose_samples
#>                          SMATSSCR SMCENTER SMPTHNTS  TISSUE
#> GTEX-1117F-0003-SM-58Q7G       NA       B1          Adipose
#> GTEX-1117F-0003-SM-6WBT7       NA       B1          Adipose
adipose_samples <- rownames(subset_adipose_samples)
adipose_samples
#> [1] "GTEX-1117F-0003-SM-58Q7G" "GTEX-1117F-0003-SM-6WBT7"

subset_skin_samples <- metadata[metadata$TISSUE %in% c("Skin"),]
subset_skin_samples
#>                               SMATSSCR SMCENTER SMPTHNTS TISSUE
#> GTEX-1117F-0003-SM-5DWSB            NA       B1            Skin
#> GTEX-1117F-0011-R10b-SM-CYKQ8       NA   B1, A1            Skin
skin_samples <- rownames(subset_skin_samples)
skin_samples
#> [1] "GTEX-1117F-0003-SM-5DWSB"      "GTEX-1117F-0011-R10b-SM-CYKQ8"

subset_tpm.df <- tpm.df[c(adipose_samples, skin_samples)]
#> Error in `[.data.frame`(tpm.df, c(adipose_samples, skin_samples)): undefined columns selected

reprex 包於 2022-07-19 創建 (v2.0.1)

注意。 此示例返回您的示例數據集的錯誤,因為“tpm.df”只有一列,但我相對確定它適用於您的實際數據

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM