简体   繁体   中英

filter items from one column based on shared items in another R

I have a table were each sample has a unique identifier but also a section identifier. I want to extract all vs all distance comparisons for each section (this data comes from a second table)

eg table 1

Sample    Section
1         1
2         1
3         1
4         2
5         2
6         3

table 2

sample    sample    distance
1         2         10
1         3         1
1         4         2
2         3         5
2         4         10
3         4         11

so my desired output is a list which has distance for: [1 vs 2], [1 vs 3], [2 vs 3], [4 vs 5] - ie all distance comparisons from table two for samples which share a section in table 1

I started trying to do this with nested for loops, but it quickly got messy.. Any ideas of a neat way to do this?

A solution using .

We can first create a data frame showing the combination of samples in each section.

library(dplyr)

table1_cross <- full_join(table1, table1, by = "Section") %>%    # Full join by Section
  filter(Sample.x != Sample.y) %>%                               # Remove records with same samples
  rowwise() %>%
  mutate(Sample.all = toString(sort(c(Sample.x, Sample.y)))) %>% # Create a column showing the combination between Sample.x and Sample.y
  ungroup() %>%
  distinct(Sample.all, .keep_all = TRUE) %>%                     # Remove duplicates in Sample.all
  select(Sample1 = Sample.x, Sample2 = Sample.y, Section)
table1_cross
# # A tibble: 4 x 3
#   Sample1 Sample2 Section
#     <int>   <int>   <int>
# 1       1       2       1
# 2       1       3       1
# 3       2       3       1
# 4       4       5       2

We can then filter table2 by table1_cross . table3 is the final output.

table3 <- table2 %>%                                     
  semi_join(table1_cross, by = c("Sample1", "Sample2")) # Filter table2 based on table1_corss

table3
#   Sample1 Sample2 distance
# 1       1       2       10
# 2       1       3        1
# 3       2       3        5

DATA

table1 <- read.table(text = "Sample    Section
1         1
                     2         1
                     3         1
                     4         2
                     5         2
                     6         3",
                     header = TRUE, stringsAsFactors = FALSE)

table2 <- read.table(text = "Sample1    Sample2    distance
1         2         10
                     1         3         1
                     1         4         2
                     2         3         5
                     2         4         10
                     3         4         11",
                     header = TRUE, stringsAsFactors = FALSE)

The OP has requested to find all distance comparisons from table2 for samples which share a section in table1 .

This can be achieved by two different approaches:

  1. Look up the respective section ids for Sample1 and Sample2 each in table1 and keep only those rows of table2 where the section ids match.
  2. Create all unique combinations of sample ids for each section in table1 and find the appropriate entries in table2 (if any).

Approach 1

Base R

tmp <- merge(table2, table1, by.x = "Sample1", by.y = "Sample")
tmp <- merge(tmp, table1, by.x = "Sample2", by.y = "Sample")
tmp[tmp$Section.x == tmp$Section.y, c("Sample2", "Sample1", "distance")]
  Sample2 Sample1 distance 1 2 1 10 2 3 1 1 3 3 2 5 

dplyr

library(dplyr)
table2 %>% 
  inner_join(table1, by = c(Sample1 = "Sample")) %>% 
  inner_join(table1, by = c(Sample2 = "Sample")) %>% 
  filter(Section.x == Section.y) %>% 
  select(-Section.x, -Section.y)
  Sample1 Sample2 distance 1 1 2 10 2 1 3 1 3 2 3 5 

data.table

Using nested joins

library(data.table)
tmp <- setDT(table1)[setDT(table2), on = .(Sample == Sample1)]
table1[tmp, on = .(Sample == Sample2)][
  Section == i.Section, .(Sample1 = i.Sample, Sample2 = Sample, distance)]

using merge() and chained data.table expressions

tmp <- merge(setDT(table2), setDT(table1), by.x = "Sample1", by.y = "Sample")
merge(tmp, table1, by.x = "Sample2", by.y = "Sample")[
  Section.x == Section.y, -c("Section.x", "Section.y")]
  Sample2 Sample1 distance 1: 2 1 10 2: 3 1 1 3: 3 2 5 

Approach 2

Base R

table1_cross <- do.call(rbind, lst <- lapply(
  split(table1, table1$Section), 
  function(x) as.data.frame(combinat::combn2(x$Sample))))
merge(table2, table1_cross, by.x = c("Sample1", "Sample2"), by.y = c("V1", "V2"))

Here, the handy combn2(x) function is used which generates all combinations of the elements of x taken two at a time, eg,

combinat::combn2(1:3)
  [,1] [,2] [1,] 1 2 [2,] 1 3 [3,] 2 3 

The tedious part is to apply combn2() to each group of Section separately and to create a data.frame which can be merged, finally.

dplyr

This is a streamlined version of www's approach

full_join(table1, table1, by = "Section") %>%
  filter(Sample.x < Sample.y) %>% 
  semi_join(x = table2, y = ., by = c(Sample1 = "Sample.x", Sample2 = "Sample.y"))

Non-equi self join

library(data.table)
setDT(table2)[setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
              .(Section, Sample1 = x.Sample, Sample2 = i.Sample)],
              on = .(Sample1, Sample2), nomatch = 0L]
  Sample1 Sample2 distance Section 1: 1 2 10 1 2: 1 3 1 1 3: 2 3 5 1 

Here, a non-equi join is used to create the unique combinations of Sample for each Section . This is equivalent to using combn2() :

setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
              .(Section, Sample1 = x.Sample, Sample2 = i.Sample)]
  Section Sample1 Sample2 1: 1 NA 1 2: 1 1 2 3: 1 1 3 4: 1 2 3 5: 2 NA 4 6: 2 4 5 7: 3 NA 6 

The NA rows will be removed in the final join.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM