I have a data.frame1 like:
Input_SNP_CHR Input_SNP_BP Set_1_CHR Set_1_BP Set_2_CHR Set_2_BP Set_3_CHR Set_3_BP
chr4 184648954 chr18 63760782 chr7 135798891 chr7 91206783
chr13 45801432 chr14 52254555 chr1 223293324 chr4 184648954
chr18 71883393 chr22 50428069 chr7 138698825 chr18 63760782
I have another data.frame2 like:
CHR BP1 BP2 Score Value
chr1 29123222 29454711 -5.7648 599
chr13 45799118 45986770 -4.8403 473
chr5 46327104 46490961 -5.3036 536
chr6 50780759 51008404 -4.4165 415
chr18 63634657 63864734 -4.8096 469
chr1 77825305 78062178 -5.4671 559
I would like to find out how many rows in each pair (a pair is Input_SNP_CHR and Input_SNP_BP together) in data.frame1 BOTH match a CHR and fall between BP1 and BP2 of data.frame2. For example, in my first pair (the Input_SNP pair) I have one match. This is the second row, where both the CHR (chr13) and BP (45801432) of Input_SNP match a row by CHR (chr13) and BP range (between 45799118 and 45986770) in data.frame2. For my second pair (the Set_1 pair) I also have 1 match (chr18) and BP 63760782 match with the 5th row of data.frame2 by chr18 and the BP range.
My desired output would be:
Input_SNP Set_1 Set_2 Set_3
1 1 0 1
How would I go about doing this in R?
Here's another possible solution using data.table
. First we melt
the data to a long format, add a Set
column according to the first df
column names, and then running foverlaps
combined with table
in order to check frequencies
library(data.table) # v 1.9.6+
Ldf <- melt(setDT(df), measure = patterns("CHR", "BP")) # Create a column for BP and CHR
Names <- unique(sub("(.*_.*)_.*", "\\1", names(df))) # Creates a sets names indx
setnames(Ldf[, variable := factor(Names[variable])], c("Set", "CHR", "BP1")) # Rename
Ldf[, BP2 := BP1] # Creating right boundary for foverlaps
setkeyv(Ldf, names(Ldf)[-1]) # Keying for foverlaps
table(foverlaps(setDT(df2), Ldf, nomatch = 0L)$Set) # Running fovelaps and checking freqs
# Input_SNP Set_1 Set_2 Set_3
# 1 1 0 1
I think the data in your first data.frame should be formatted like this
# CHR type BP
# 1.1 chr4 Input_SNP 184648954
# 1.2 chr13 Input_SNP 45801432
# 1.3 chr18 Input_SNP 71883393
# 2.1 chr18 Set_1 63760782
# 2.2 chr14 Set_1 52254555
# 2.3 chr22 Set_1 50428069
# 3.1 chr7 Set_2 135798891
# 3.2 chr1 Set_2 223293324
# 3.3 chr7 Set_2 138698825
# 4.1 chr7 Set_3 91206783
# 4.2 chr4 Set_3 184648954
# 4.3 chr18 Set_3 63760782
(Row names are not important though.)
Ideally you would generate the data like that but if you already have it in the format like you provided, you can transform it via (assuming the name of your first data.frame is df
)
type_list=lapply(strsplit(colnames(df),"_"),
function(x) c(paste0(x[1],"_",x[2])))
df_new=do.call("rbind",
lapply(split(1:ncol(df),sort(rep(1:(ncol(df)/2),times=2))),
function(idxs) {
data.frame(CHR=df[,idxs[1]],
type=type_list[[idxs[1]]],
BP=df[,idxs[2]])}))
Then it's just two lines of base R to accomplish your task (assuming the second data.frame is df2
)
df_new_2=within(merge(df_new,df2,by="CHR"),
cnt<-BP>=BP1&BP<=BP2)
sapply(split(df_new_2,df_new_2$type),function(x) sum(x$cnt))
#Input_SNP Set_1 Set_2 Set_3
# 1 1 0 1
(I only get one hit for Set_3
as only chr18
matches.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.