簡體   English   中英

解析維恩表以在R中創建維恩圖

[英]Parsing venn table to create Venn Diagram in R

我有一些帶有維恩圖值的表,我試圖將這些值讀入R並進行解析,以便使用VennDiagram軟件包進行繪制。 我的表如下所示:

H3K27AC.bed H3K4ME3.bed gencode.bed Total   Name
        X   19184   gencode.bed
    X       6843    H3K4ME3.bed
    X   X   3942    H3K4ME3.bed|gencode.bed
X           5097    H3K27AC.bed
X       X   1262    H3K27AC.bed|gencode.bed
X   X       4208    H3K27AC.bed|H3K4ME3.bed
X   X   X   9222    H3K27AC.bed|H3K4ME3.bed|gencode.bed

我可以像這樣在數據框中讀取表:

> venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE)
> venn_table_df
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1                                   X 19184                         gencode.bed
2                       X              6843                         H3K4ME3.bed
3                       X           X  3942             H3K4ME3.bed|gencode.bed
4           X                          5097                         H3K27AC.bed
5           X                       X  1262             H3K27AC.bed|gencode.bed
6           X           X              4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

我可以從這樣的表中獲取維恩圖的類別

> venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 
> venn_categories
[1] "H3K27AC.bed" "H3K4ME3.bed" "gencode.bed"

我什至可以制作一個更易於閱讀的摘要表,如下所示:

> venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
> venn_summary
  Total                                Name
1 19184                         gencode.bed
2  6843                         H3K4ME3.bed
3  3942             H3K4ME3.bed|gencode.bed
4  5097                         H3K27AC.bed
5  1262             H3K27AC.bed|gencode.bed
6  4208             H3K27AC.bed|H3K4ME3.bed
7  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

但是讓我感到困擾的是如何從表中獲取值並將它們正確地分配給維恩圖的區域。 作為參考,三重維恩函數如下所示:

n1<-5097
n2<-6843
n3<-19184

n12<-4208
n13<-1262
n23<-3942

n123<-9222

venn <-draw.triple.venn(area1=n1+n12+n13+n123,
                        area2=n2+n23+n12+n123,
                        area3=n3+n23+n13+n123,
                        n12=n12+n123,
                        n13=n13+n123,
                        n23=n23+n123,
                        n123=n123,
                        category=venn_categories,
                        fill=c('red','blue','green'),
                        alpha=c(rep(0.3,3)))

但這顯然需要手動設置值,這是不理想的,因為我擁有許多這些數據集,並且還需要將其擴展到4路和5路Venn。 如何獲得R為維恩中的每個字段找到正確的值? 我已經嘗試使用grepgrepl和將子幀的數據grepl為與圖的每個區域相匹配的行的多種方法,但是這種方法無法正常工作。 有什么建議么? 順便說一句,此數據是從HOMER軟件包的mergePeaks程序輸出的。

我想我已經解決了,使用正則表達式在表格中搜索情節的正確條目。 這是完整的工作流程:

# load packages
library('VennDiagram')
library('gridExtra')

# read in the venn text
venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE)
venn_table_df

看起來像這樣:

> venn_table_df
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1                                   X 19184                         gencode.bed
2                       X              6843                         H3K4ME3.bed
3                       X           X  3942             H3K4ME3.bed|gencode.bed
4           X                          5097                         H3K27AC.bed
5           X                       X  1262             H3K27AC.bed|gencode.bed
6           X           X              4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

> # recreate it with this btw
> dput(venn_table_df)
structure(list(H3K27AC.bed = c("", "", "", "X", "X", "X", "X"
), H3K4ME3.bed = c("", "X", "X", "", "", "X", "X"), gencode.bed = c("X", 
"", "X", "", "X", "", "X"), Total = c(19184L, 6843L, 3942L, 5097L, 
1262L, 4208L, 9222L), Name = c("gencode.bed", "H3K4ME3.bed", 
"H3K4ME3.bed|gencode.bed", "H3K27AC.bed", "H3K27AC.bed|gencode.bed", 
"H3K27AC.bed|H3K4ME3.bed", "H3K27AC.bed|H3K4ME3.bed|gencode.bed"
)), .Names = c("H3K27AC.bed", "H3K4ME3.bed", "gencode.bed", "Total", 
"Name"), class = "data.frame", row.names = c(NA, -7L))

然后解析表

# get the venn categories
venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 


# make a summary table
venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
venn_summary

# get the areas for the venn; add up all the overlaps that contain the given category 

# area1
area_n1<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# area2
area_n2<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# area3
area_n3<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n12
area_n12<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n13
area_n13<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])

# n23
area_n23<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])


# n123
area_n123<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])


venn <-draw.triple.venn(area1=area_n1,
                        area2=area_n2,
                        area3=area_n3,
                        n12=area_n12,
                        n13=area_n13,
                        n23=area_n23,
                        n123=area_n123,
                        category=venn_categories,
                        fill=c('red','blue','green'),
                        alpha=c(rep(0.3,3)))

關鍵是使用正則表達式僅獲取包含維恩區域所有類別的表條目。 這比我期望的要復雜得多,並且需要手動設置以適應四向和五向靜脈,但是到目前為止,它仍然有效。 我對其他建議持開放態度,這些建議可能能夠簡化流程並更容易地擴大規模。

萬一有人覺得這有用,現在有一個非常簡單的過程可以將這些數字轉換成近似成比例的維恩圖。 使用nVennR軟件包創建圖表的方法之一是從頭開始。 小插圖中所述,每個區域的值以特定順序輸入,恰好與表中的相同。 唯一的區別是nVennR期望在開始時再加上一個值,該值對應於外部區域(該值應為0,但無論如何它將被忽略)。 這使過程非常容易:

> vt <- read.table('clipboard', header = T)
> vt
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1           0           0           X 19184                         gencode.bed
2           0           X           0  6843                         H3K4ME3.bed
3           0           X           X  3942             H3K4ME3.bed|gencode.bed
4           X           0           0  5097                         H3K27AC.bed
5           X           0           X  1262             H3K27AC.bed|gencode.bed
6           X           X           0  4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed
> myV <- createVennObj(nSets = 3, sNames = c('H3K27Ac', 'H3K4ME3', 'gencode'), sSizes = c(0, vt$Total))
> vp <- plotVenn(nVennObj = myV)

結果: 結果 此過程的另一個優點是,它可以擴展到更多的組。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM