简体   繁体   中英

compare multiple columns in one or two data frames r

I have a data frame:

name<-c('a','b','c','d','e')
type<-c('x','x','y','x','y')
chr<- c('ch1','ch1','ch1','ch2','ch2')
pos<- c(5000, 5100, 4999,5500,5100)
df<-data.frame(name,type, chr,pos)

df
I would like to go through each row and if type is NOT equal, AND chr IS equal, AND pos is within abs(100) then create a new df with the matches (with a new column of the name of the match). For the above df rows 1 and 3 would match and the result would be

在此处输入图片说明

Ideally I don't want the reciprocal match and so I would want

在此处输入图片说明

If it is easier I can split into two dfs according to type.

I have tried a variation of merge and filter (dplyr) but am getting nowhere.

We can split the data based on type column, do a full_join by 'chr' column and filter the rows where absolute values between pos columns is less than 100.

library(dplyr)

df %>%
  group_split(type) %>%
  purrr::reduce(full_join, by = 'chr') %>%
  filter(abs(pos.x - pos.y) < 100)

# A tibble: 1 x 7
#  name.x type.x chr   pos.x name.y type.y pos.y
#  <fct>  <fct>  <fct> <dbl> <fct>  <fct>  <dbl>
#1  a      x      ch1    5000 c      y       4999

You can then remove any unwanted columns and rename them as per your requirement.

I'm sure others here will come up with something more elegant than row slicing, but this seems to work if you a full data frame of all the matches:

library(tidyverse)

find_matches <- function(i) {
  row_of_interest <- df[i, ]
  df_rest <- df[-i, ]
  names(df_rest) <- str_c(names(df_rest), ".x")

  df_rest %>% 
    cbind(row_of_interest) %>% 
    filter(type != type.x, abs(pos - pos.x) < 100) %>% 
    transmute(name, type, chr, pos, match = name.x)
}

map_dfr(1:5, find_matches)

  name type chr  pos match
1    a    x ch1 5000     c
2    b    x ch1 5100     e
3    c    y ch1 4999     a
4    e    y ch2 5100     b

An data.table option using a non-equi join, which should be faster for a large dataset:

library(data.table)
setDT(df)[, c("s", "e") := .(pos - 100, pos + 100)]

#perform non-equi join based on desired conditions
pair <- df[df, on=.(chr, s<=pos, e>=pos), nomatch=0L,
    .(name=i.name[x.type!=i.type], match=x.name[x.type!=i.type])]

#extract rows with matches while removing reciprocals
df[unique(pair[, .(name=pmin(name, match), match=pmax(name, match))]), on=.(name)]

output:

   name type chr  pos    s    e match
1:    a    x ch1 5000 4900 5100     c

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM