根據另一列中的共享項目從一列中過濾項目

Question

我有一個表，每個樣本都有一個唯一的標識符，還有一個節標識符。 我想提取每個部分的所有距離對比和所有距離對比（此數據來自第二張表）

例如表1

Sample    Section
1         1
2         1
3         1
4         2
5         2
6         3

表2

sample    sample    distance
1         2         10
1         3         1
1         4         2
2         3         5
2         4         10
3         4         11

因此，我想要的輸出是一個列表，該列表的距離為：[1 vs 2]，[1 vs 3]，[2 vs 3]，[4 vs 5]-即表2中所有共享一個截面的樣本的距離比較表格1

我開始嘗試使用嵌套的for循環來執行此操作，但是很快就變得混亂了。

Answer 1

使用dplyr的解決方案。

我們首先可以創建一個數據框，顯示每個部分中樣本的組合。

library(dplyr)

table1_cross <- full_join(table1, table1, by = "Section") %>%    # Full join by Section
  filter(Sample.x != Sample.y) %>%                               # Remove records with same samples
  rowwise() %>%
  mutate(Sample.all = toString(sort(c(Sample.x, Sample.y)))) %>% # Create a column showing the combination between Sample.x and Sample.y
  ungroup() %>%
  distinct(Sample.all, .keep_all = TRUE) %>%                     # Remove duplicates in Sample.all
  select(Sample1 = Sample.x, Sample2 = Sample.y, Section)
table1_cross
# # A tibble: 4 x 3
#   Sample1 Sample2 Section
#     <int>   <int>   <int>
# 1       1       2       1
# 2       1       3       1
# 3       2       3       1
# 4       4       5       2

然后，我們可以通過table1_cross過濾table2 。 table3是最終輸出。

table3 <- table2 %>%                                     
  semi_join(table1_cross, by = c("Sample1", "Sample2")) # Filter table2 based on table1_corss

table3
#   Sample1 Sample2 distance
# 1       1       2       10
# 2       1       3        1
# 3       2       3        5

數據

table1 <- read.table(text = "Sample    Section
1         1
                     2         1
                     3         1
                     4         2
                     5         2
                     6         3",
                     header = TRUE, stringsAsFactors = FALSE)

table2 <- read.table(text = "Sample1    Sample2    distance
1         2         10
                     1         3         1
                     1         4         2
                     2         3         5
                     2         4         10
                     3         4         11",
                     header = TRUE, stringsAsFactors = FALSE)

Answer 2

OP要求找到與table2共享一個table1部分的樣本的所有table2距離比較。

這可以通過兩種不同的方法來實現：

查找為各個部分IDS Sample1和Sample2各table1和保持的只有那些行table2 。其中，部分ID匹配。
為table2 table1每個部分創建所有唯一的樣本ID組合，並在table2找到適當的條目（如果有）。

方法1

基數R

tmp <- merge(table2, table1, by.x = "Sample1", by.y = "Sample")
tmp <- merge(tmp, table1, by.x = "Sample2", by.y = "Sample")
tmp[tmp$Section.x == tmp$Section.y, c("Sample2", "Sample1", "distance")]

  Sample2 Sample1 distance 1 2 1 10 2 3 1 1 3 3 2 5

`dplyr`

library(dplyr)
table2 %>% 
  inner_join(table1, by = c(Sample1 = "Sample")) %>% 
  inner_join(table1, by = c(Sample2 = "Sample")) %>% 
  filter(Section.x == Section.y) %>% 
  select(-Section.x, -Section.y)

  Sample1 Sample2 distance 1 1 2 10 2 1 3 1 3 2 3 5

`data.table`

使用嵌套聯接

library(data.table)
tmp <- setDT(table1)[setDT(table2), on = .(Sample == Sample1)]
table1[tmp, on = .(Sample == Sample2)][
  Section == i.Section, .(Sample1 = i.Sample, Sample2 = Sample, distance)]

使用merge（）和鏈接的data.table表達式

tmp <- merge(setDT(table2), setDT(table1), by.x = "Sample1", by.y = "Sample")
merge(tmp, table1, by.x = "Sample2", by.y = "Sample")[
  Section.x == Section.y, -c("Section.x", "Section.y")]

  Sample2 Sample1 distance 1: 2 1 10 2: 3 1 1 3: 3 2 5

方法2

基數R

table1_cross <- do.call(rbind, lst <- lapply(
  split(table1, table1$Section), 
  function(x) as.data.frame(combinat::combn2(x$Sample))))
merge(table2, table1_cross, by.x = c("Sample1", "Sample2"), by.y = c("V1", "V2"))

此處，使用了方便的combn2(x)函數，該函數生成一次取兩個的x元素的所有組合，例如，

combinat::combn2(1:3)

  [,1] [,2] [1,] 1 2 [2,] 1 3 [3,] 2 3

繁瑣的部分是施加combn2()到每個組的Section分開，並創建其可以合並一個data.frame，最后。

`dplyr`

這是www方法的簡化版本

full_join(table1, table1, by = "Section") %>%
  filter(Sample.x < Sample.y) %>% 
  semi_join(x = table2, y = ., by = c(Sample1 = "Sample.x", Sample2 = "Sample.y"))

非裝備自我加入

library(data.table)
setDT(table2)[setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
              .(Section, Sample1 = x.Sample, Sample2 = i.Sample)],
              on = .(Sample1, Sample2), nomatch = 0L]

  Sample1 Sample2 distance Section 1: 1 2 10 1 2: 1 3 1 1 3: 2 3 5 1

在這里，非等參聯接用於為每個Section創建Sample的唯一組合。 這等效於使用combn2() ：

setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
              .(Section, Sample1 = x.Sample, Sample2 = i.Sample)]

  Section Sample1 Sample2 1: 1 NA 1 2: 1 1 2 3: 1 1 3 4: 1 2 3 5: 2 NA 4 6: 2 4 5 7: 3 NA 6

NA行將在最終連接中刪除。

根據另一列中的共享項目從一列中過濾項目

問題描述

2 個解決方案

解決方案1
1 已采納 2017-12-29 12:31:33

解決方案2
0 2017-12-30 20:02:22

方法1

基數R

`dplyr`

`data.table`

方法2

基數R

`dplyr`

非裝備自我加入

根據另一列中的共享項目從一列中過濾項目

問題描述

2 個解決方案

解決方案1 1 已采納 2017-12-29 12:31:33

解決方案2 0 2017-12-30 20:02:22

方法1

基數R

dplyr

data.table

方法2

基數R

dplyr

非裝備自我加入

解決方案1
1 已采納 2017-12-29 12:31:33

解決方案2
0 2017-12-30 20:02:22

`dplyr`

`data.table`

`dplyr`