[英]How to filter and subset a dataframe based on another dataframe in R
盡管有許多與此類似的問題,但我仍無法在R中專門找到該問題的答案,所以我不確定從哪里開始。 我有2個數據集:
數據1:
Chr Start End rssnp1 Type gene
1 1244733 1244734 rs2286773 LD_SNP ACE
1 1257536 1257436 rs301159 LD_SNP CPEB4
1 1252336 1252336 rs2286773 Sentinel CPEB4
1 1252343 1252343 rs301159 LD_SNP CPEB4
1 1254841 1254841 rs301159 LD_SNP CPEB4
1 1256703 1267404 rs301159 LD_SNP CPEB4
1 1269246 1269246 rs301159 LD_SNP CPEB4
1 1370168 1370168 rs301159 LD_SNP GLUPA1
1 1371824 1371824 rs301159 LD_SNP GLUPA1
1 1372591 1372591 rs301159 LD_SNP GLUPA1
數據2:
gene
CPEB4
GML
TBX2
PNKD
JMJD1C
SKI
MYH11
Data2是機器學習(已被分類為影響疾病的基因)的輸出。
我正在尋找從Data2中選擇一個基因,在Data1中找到它,特別是找到具有Type列為'Sentinel'的基因行,然后根據前哨基因的rssnp1列過濾Data1。
例如,如果我正在從Data1中搜索CPEB4基因,並發現其Sentinel基因rssnp1(rs2286773)要通過輸出過濾,則為:
Chr Start End rssnp1 Type gene
1 1243933 1243934 rs2286773 LD_SNP ACAP3
1 1254436 1254436 rs2286773 Sentinel CPEB4
到目前為止,我已經看過使用merge,filter()和subset(),但是由於我有很多步驟,我應該嘗試在for循環中使用它們嗎? 有更好的功能嗎?
我是R的新手,所以並沒有做太多的事情,例如,我嘗試過合並數據集:
merged <- merge(data1, data2, by='gene', all='TRUE')
然后可以在excel手動過濾中使用,但是理想情況下,我想進一步自動化它,因此,向正確方向的任何建議/幫助將不勝感激。
一種選擇是在full_join之后通過“ rssnp1”進行full_join
並在“類型”中filter
any
值作為“前哨”
library(dplyr)
full_join(data1, data2, by = 'gene') %>%
group_by(rssnp1) %>%
filter(any(Type == "Sentinel")) #or
#filter("Sentinel" %in% Type)
# A tibble: 2 x 6
# Groups: rssnp1 [1]
# Chr Start End rssnp1 Type gene
# <int> <int> <int> <chr> <chr> <chr>
#1 1 1244733 1244734 rs2286773 LD_SNP ACE
#2 1 1252336 1252336 rs2286773 Sentinel CPEB4
或使用OP的代碼,可以使用ave
對其進行進一步擴展
i1 <- with(merged, ave(Type %in% "Sentinel", rssnp1, FUN = any))
merged[i1,]
data1 <- structure(list(Chr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), Start = c(1244733L, 1257536L, 1252336L, 1252343L, 1254841L,
1256703L, 1269246L, 1370168L, 1371824L, 1372591L), End = c(1244734L,
1257436L, 1252336L, 1252343L, 1254841L, 1267404L, 1269246L, 1370168L,
1371824L, 1372591L), rssnp1 = c("rs2286773", "rs301159", "rs2286773",
"rs301159", "rs301159", "rs301159", "rs301159", "rs301159", "rs301159",
"rs301159"), Type = c("LD_SNP", "LD_SNP", "Sentinel", "LD_SNP",
"LD_SNP", "LD_SNP", "LD_SNP", "LD_SNP", "LD_SNP", "LD_SNP"),
gene = c("ACE", "CPEB4", "CPEB4", "CPEB4", "CPEB4", "CPEB4",
"CPEB4", "GLUPA1", "GLUPA1", "GLUPA1")),
class = "data.frame", row.names = c(NA,
-10L))
data2 <- structure(list(gene = c("CPEB4", "GML", "TBX2", "PNKD", "JMJD1C",
"SKI", "MYH11")), class = "data.frame", row.names = c(NA, -7L
))
我假設您想從data2傳遞每個基因,並從data1獲得它們的相應數據。 希望以下代碼能對您有所幫助。
library(dplyr)
getFromData1 <- function(geneFromData2 = NULL) {
if (is.null(geneFromData2)) return()
geneSentinelSNP <- (data1 %>% filter(Type == "Sentinel" & gene == geneFromData2))$rssnp1
data1 %>% filter(rssnp1 == geneSentinelSNP)
}
getFromData1(geneFromData2 = "CPEB4")
您也可以撥打getFromData1
功能的lapply
,讓你得到的數據幀,一個用於從數據2中各基因的列表。
lapply(data2$gene, getFromData1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.