I have a data frame:
name<-c('a','b','c','d','e')
type<-c('x','x','y','x','y')
chr<- c('ch1','ch1','ch1','ch2','ch2')
pos<- c(5000, 5100, 4999,5500,5100)
df<-data.frame(name,type, chr,pos)
I would like to go through each row and if type is NOT equal, AND chr IS equal, AND pos is within abs(100) then create a new df with the matches (with a new column of the name of the match). For the above df rows 1 and 3 would match and the result would be
Ideally I don't want the reciprocal match and so I would want
If it is easier I can split into two dfs according to type.
I have tried a variation of merge and filter (dplyr) but am getting nowhere.
We can split the data based on type
column, do a full_join
by
'chr'
column and filter
the rows where absolute values between pos
columns is less than 100.
library(dplyr)
df %>%
group_split(type) %>%
purrr::reduce(full_join, by = 'chr') %>%
filter(abs(pos.x - pos.y) < 100)
# A tibble: 1 x 7
# name.x type.x chr pos.x name.y type.y pos.y
# <fct> <fct> <fct> <dbl> <fct> <fct> <dbl>
#1 a x ch1 5000 c y 4999
You can then remove any unwanted columns and rename them as per your requirement.
I'm sure others here will come up with something more elegant than row slicing, but this seems to work if you a full data frame of all the matches:
library(tidyverse)
find_matches <- function(i) {
row_of_interest <- df[i, ]
df_rest <- df[-i, ]
names(df_rest) <- str_c(names(df_rest), ".x")
df_rest %>%
cbind(row_of_interest) %>%
filter(type != type.x, abs(pos - pos.x) < 100) %>%
transmute(name, type, chr, pos, match = name.x)
}
map_dfr(1:5, find_matches)
name type chr pos match
1 a x ch1 5000 c
2 b x ch1 5100 e
3 c y ch1 4999 a
4 e y ch2 5100 b
An data.table
option using a non-equi join, which should be faster for a large dataset:
library(data.table)
setDT(df)[, c("s", "e") := .(pos - 100, pos + 100)]
#perform non-equi join based on desired conditions
pair <- df[df, on=.(chr, s<=pos, e>=pos), nomatch=0L,
.(name=i.name[x.type!=i.type], match=x.name[x.type!=i.type])]
#extract rows with matches while removing reciprocals
df[unique(pair[, .(name=pmin(name, match), match=pmax(name, match))]), on=.(name)]
output:
name type chr pos s e match
1: a x ch1 5000 4900 5100 c
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.