[英]How to subset dataframe in R based on another data
我有一個包含大量 RNA seq 計數的數據框(樣本名作為列名,基因作為行名),以及一個元數據文件,即性別、組織類型、疾病狀態等(樣本名作為行名和性別等列名)我想要只包含 2 種組織類型的 RNAseq 計數數據的子集,以便我可以查看 DGE。 有人可以建議最好的方法嗎? 我對處理 RNA seq 數據非常陌生,所以這可能很明顯!
謝謝!
編輯:有 >1000 個樣本,因此通過列名對列進行子集化可能不准確
希望這可以讓您深入了解計數數據:
dput(head(tpm.df[1:2]))
structure(list(Description = c("DDX11L1", "WASH7P", "MIR6859-1",
"MIR1302-2HG", "FAM138A", "OR4G4P"), `GTEX-1117F-0226-SM-5GZZ7` = c(0L,
187L, 0L, 1L, 0L, 0L)), row.names = c("ENSG00000223972.5",
"ENSG00000227232.5",
"ENSG00000278267.1", "ENSG00000243485.5", "ENSG00000237613.2",
"ENSG00000268020.3"), class = "data.frame")
這是元數據:
structure(list(SMATSSCR = c(NA, NA, NA, NA, NA, 0L), SMCENTER = c("B1",
"B1", "B1", "B1, A1", "B1, A1", "B1"), SMPTHNTS = c("", "", "",
"", "", "2 pieces, ~15% vessel stroma, rep delineated")), row.names =
c("GTEX-1117F-0003-SM-58Q7G",
"GTEX-1117F-0003-SM-5DWSB", "GTEX-1117F-0003-SM-6WBT7", "GTEX-1117F-
0011-R10a-SM-AHZ7F",
"GTEX-1117F-0011-R10b-SM-CYKQ8", "GTEX-1117F-0226-SM-5GZZ7"), class =
"data.frame")
您的“元數據”數據框中是否有“組織”列? 如果是這樣,您可以使用它來子集您的“元數據”數據框,然后使用它來子集您的 tpm 值,例如
tpm.df <-
structure(
list(
Description = c(
"DDX11L1",
"WASH7P",
"MIR6859-1",
"MIR1302-2HG",
"FAM138A",
"OR4G4P"
),
`GTEX-1117F-0226-SM-5GZZ7` = c(0L, 187L, 0L, 1L, 0L, 0L)
),
row.names = c(
"ENSG00000223972.5",
"ENSG00000227232.5",
"ENSG00000278267.1",
"ENSG00000243485.5",
"ENSG00000237613.2",
"ENSG00000268020.3"
),
class = "data.frame"
)
metadata <- structure(list(SMATSSCR = c(NA, NA, NA, NA, NA, 0L),
SMCENTER = c("B1", "B1", "B1", "B1, A1", "B1, A1", "B1"),
SMPTHNTS = c("", "", "", "", "", "2 pieces, ~15% vessel stroma, rep delineated"),
TISSUE = c("Adipose", "Skin", "Adipose", "Muscle", "Skin", "Nerve")),
row.names = c("GTEX-1117F-0003-SM-58Q7G", "GTEX-1117F-0003-SM-5DWSB", "GTEX-1117F-0003-SM-6WBT7", "GTEX-1117F-0011-R10a-SM-AHZ7F",
"GTEX-1117F-0011-R10b-SM-CYKQ8", "GTEX-1117F-0226-SM-5GZZ7"), class =
"data.frame")
tpm.df
#> Description GTEX-1117F-0226-SM-5GZZ7
#> ENSG00000223972.5 DDX11L1 0
#> ENSG00000227232.5 WASH7P 187
#> ENSG00000278267.1 MIR6859-1 0
#> ENSG00000243485.5 MIR1302-2HG 1
#> ENSG00000237613.2 FAM138A 0
#> ENSG00000268020.3 OR4G4P 0
metadata
#> SMATSSCR SMCENTER
#> GTEX-1117F-0003-SM-58Q7G NA B1
#> GTEX-1117F-0003-SM-5DWSB NA B1
#> GTEX-1117F-0003-SM-6WBT7 NA B1
#> GTEX-1117F-0011-R10a-SM-AHZ7F NA B1, A1
#> GTEX-1117F-0011-R10b-SM-CYKQ8 NA B1, A1
#> GTEX-1117F-0226-SM-5GZZ7 0 B1
#> SMPTHNTS
#> GTEX-1117F-0003-SM-58Q7G
#> GTEX-1117F-0003-SM-5DWSB
#> GTEX-1117F-0003-SM-6WBT7
#> GTEX-1117F-0011-R10a-SM-AHZ7F
#> GTEX-1117F-0011-R10b-SM-CYKQ8
#> GTEX-1117F-0226-SM-5GZZ7 2 pieces, ~15% vessel stroma, rep delineated
#> TISSUE
#> GTEX-1117F-0003-SM-58Q7G Adipose
#> GTEX-1117F-0003-SM-5DWSB Skin
#> GTEX-1117F-0003-SM-6WBT7 Adipose
#> GTEX-1117F-0011-R10a-SM-AHZ7F Muscle
#> GTEX-1117F-0011-R10b-SM-CYKQ8 Skin
#> GTEX-1117F-0226-SM-5GZZ7 Nerve
# One way to find samples of interest
subset_adipose_samples <- metadata[metadata$TISSUE %in% c("Adipose"),]
subset_adipose_samples
#> SMATSSCR SMCENTER SMPTHNTS TISSUE
#> GTEX-1117F-0003-SM-58Q7G NA B1 Adipose
#> GTEX-1117F-0003-SM-6WBT7 NA B1 Adipose
adipose_samples <- rownames(subset_adipose_samples)
adipose_samples
#> [1] "GTEX-1117F-0003-SM-58Q7G" "GTEX-1117F-0003-SM-6WBT7"
subset_skin_samples <- metadata[metadata$TISSUE %in% c("Skin"),]
subset_skin_samples
#> SMATSSCR SMCENTER SMPTHNTS TISSUE
#> GTEX-1117F-0003-SM-5DWSB NA B1 Skin
#> GTEX-1117F-0011-R10b-SM-CYKQ8 NA B1, A1 Skin
skin_samples <- rownames(subset_skin_samples)
skin_samples
#> [1] "GTEX-1117F-0003-SM-5DWSB" "GTEX-1117F-0011-R10b-SM-CYKQ8"
subset_tpm.df <- tpm.df[c(adipose_samples, skin_samples)]
#> Error in `[.data.frame`(tpm.df, c(adipose_samples, skin_samples)): undefined columns selected
由reprex 包於 2022-07-19 創建 (v2.0.1)
注意。 此示例返回您的示例數據集的錯誤,因為“tpm.df”只有一列,但我相對確定它適用於您的實際數據
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.