简体   繁体   English

如何查找列中与另一个数据帧范围匹配的单元格数?

[英]How to find the number of cells in a column that match another data frame's range?

I have a data.frame1 like: 我有一个data.frame1像:

Input_SNP_CHR   Input_SNP_BP     Set_1_CHR   Set_1_BP     Set_2_CHR   Set_2_BP     Set_3_CHR   Set_3_BP
    chr4         184648954        chr18      63760782       chr7      135798891      chr7        91206783  
    chr13        45801432         chr14      52254555       chr1      223293324      chr4        184648954
    chr18        71883393         chr22      50428069       chr7      138698825      chr18       63760782

I have another data.frame2 like: 我有另一个data.frame2像:

CHR     BP1             BP2             Score   Value
chr1    29123222        29454711        -5.7648 599
chr13   45799118        45986770        -4.8403 473
chr5    46327104        46490961        -5.3036 536
chr6    50780759        51008404        -4.4165 415
chr18   63634657        63864734        -4.8096 469
chr1    77825305        78062178        -5.4671 559

I would like to find out how many rows in each pair (a pair is Input_SNP_CHR and Input_SNP_BP together) in data.frame1 BOTH match a CHR and fall between BP1 and BP2 of data.frame2. 我想知道data.frame1中每对中有多少行(一对是Input_SNP_CHR和Input_SNP_BP)两者匹配CHR并且落在data.frame2的BP1和BP2之间。 For example, in my first pair (the Input_SNP pair) I have one match. 例如,在我的第一对(Input_SNP对)中,我有一个匹配。 This is the second row, where both the CHR (chr13) and BP (45801432) of Input_SNP match a row by CHR (chr13) and BP range (between 45799118 and 45986770) in data.frame2. 这是第二行,其中Input_SNP的CHR(chr13)和BP(45801432)在data.frame2中与CHR(chr13)和BP范围(在45799118和45986770之间)的行匹配。 For my second pair (the Set_1 pair) I also have 1 match (chr18) and BP 63760782 match with the 5th row of data.frame2 by chr18 and the BP range. 对于我的第二对(Set_1对),我还有1个匹配(chr18),BP 63760782与chr18的第5行data.frame2和BP范围匹配。

My desired output would be: 我想要的输出是:

Input_SNP     Set_1     Set_2     Set_3
1             1         0         1

How would I go about doing this in R? 我怎么会在R中这样做呢?

Here's another possible solution using data.table . 这是使用data.table的另一种可能的解决方案。 First we melt the data to a long format, add a Set column according to the first df column names, and then running foverlaps combined with table in order to check frequencies 首先,我们将数据melt为长格式,根据第一个df列名添加Set列,然后运行foverlapstable结合以检查频率

library(data.table) # v 1.9.6+
Ldf <- melt(setDT(df), measure = patterns("CHR", "BP")) # Create a column for BP and CHR
Names <- unique(sub("(.*_.*)_.*", "\\1", names(df))) # Creates a sets names indx
setnames(Ldf[, variable := factor(Names[variable])], c("Set", "CHR", "BP1")) # Rename   
Ldf[, BP2 := BP1] # Creating right boundary for foverlaps
setkeyv(Ldf, names(Ldf)[-1]) # Keying for foverlaps
table(foverlaps(setDT(df2), Ldf, nomatch = 0L)$Set) # Running fovelaps and checking freqs 
# Input_SNP     Set_1     Set_2     Set_3 
#         1         1         0         1 

I think the data in your first data.frame should be formatted like this 我认为你的第一个data.frame中的数据应该像这样格式化

#       CHR      type        BP
# 1.1  chr4 Input_SNP 184648954
# 1.2 chr13 Input_SNP  45801432
# 1.3 chr18 Input_SNP  71883393
# 2.1 chr18     Set_1  63760782
# 2.2 chr14     Set_1  52254555
# 2.3 chr22     Set_1  50428069
# 3.1  chr7     Set_2 135798891
# 3.2  chr1     Set_2 223293324
# 3.3  chr7     Set_2 138698825
# 4.1  chr7     Set_3  91206783
# 4.2  chr4     Set_3 184648954
# 4.3 chr18     Set_3  63760782

(Row names are not important though.) (但行名不重要。)

Ideally you would generate the data like that but if you already have it in the format like you provided, you can transform it via (assuming the name of your first data.frame is df ) 理想情况下,你会生成这样的数据,但如果你已经按照你提供的格式生成它,你可以通过它来转换它(假设你的第一个data.frame的名字是df

type_list=lapply(strsplit(colnames(df),"_"),
                 function(x) c(paste0(x[1],"_",x[2])))

df_new=do.call("rbind",
               lapply(split(1:ncol(df),sort(rep(1:(ncol(df)/2),times=2))),
                      function(idxs) {
                        data.frame(CHR=df[,idxs[1]],
                                   type=type_list[[idxs[1]]],
                                   BP=df[,idxs[2]])}))

Then it's just two lines of base R to accomplish your task (assuming the second data.frame is df2 ) 然后它只是两行基础R来完成你的任务(假设第二个data.frame是df2

df_new_2=within(merge(df_new,df2,by="CHR"),
                cnt<-BP>=BP1&BP<=BP2)

sapply(split(df_new_2,df_new_2$type),function(x) sum(x$cnt))
#Input_SNP     Set_1     Set_2     Set_3 
#        1         1         0         1 

(I only get one hit for Set_3 as only chr18 matches.) (我只有一次击中Set_3因为只有chr18匹配。)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何找到列和数据框之间的匹配? - how to find match between a column and a data frame? 如何在R的数据帧的列中查找属于特定范围的项目数 - How to find number of items which falls in a specific range in a column of a data frame in R 在列的数据框中查找相同的观察值但在另一列中不同 - Find identical observations in a column´s data frame but different in another column 检查数据框中的单元格是否与另一列相同 - check if cells in data frame is identical to another column 如何将一列与另一列的出现次数一起添加到data.frame - how add a column to a data.frame with the occurence number of an another column 如何在数据框列中找到行索引号? - how can I find row index number in data frame column? 如何找到或匹配一个数据帧作为子集(完整)到R中的另一个数据帧? - how find or match one data frame as a subset(full) into another data frame in R? 从数据框中删除其列值与另一个数据框的列值不匹配的数据 - R - remove rows from data frame whose column values don't match another data frame's column values - R 如何使用另一个数据框的列和行组合在 dataframe 中查找值? - How to find value in dataframe with column and row combination of another data frame? 如何将一个数据帧中的单个ID与R中另一个数据帧中的ID的倍数匹配? - How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM