简体   繁体   English

识别另一个列表中包含的同时是数据框元素的列表元素

[英]Identifying list elements contained in another list that are both elements of a data frame

I have two data frames, DF1, DF2, each with two columns (a, b).我有两个数据框,DF1,DF2,每个都有两列(a,b)。 One column (a) is a unique identifier the other is a column (b) with elements that contain a list.一列 (a) 是唯一标识符,另一列是 (b) 列,其中包含包含列表的元素。 The list contains label names.该列表包含 label 名称。 I would like to search DF2$b elements to see if they are contained inside DF1$b, if so I'd like to create a new column, DF2$c, which takes the identifier in DF1a.我想搜索 DF2$b 元素以查看它们是否包含在 DF1$b 中,如果是,我想创建一个新列 DF2$c,它采用 DF1a 中的标识符。 The tricky part, is that I'd only like to take the unique identifier if it's the smallest union that exists in the data frame.棘手的部分是,如果它是数据框中存在的最小联合,我只想采用唯一标识符。 As some background, this data is from a phylogenetic tree.作为一些背景,这些数据来自系统发育树。 The DF2 is a subsample of DF1. DF2 是 DF1 的子样本。 All tips in DF2 are contained in DF1. DF2 中的所有提示都包含在 DF1 中。 I want to compare the nodes of DF2 to DF1 (the node names are different), but I can identify the nodes from the tips that are descendents from each.我想将 DF2 的节点与 DF1 的节点进行比较(节点名称不同),但我可以从每个节点的后代的提示中识别节点。

It would be easier if I explain with an example:如果我用一个例子来解释会更容易:

df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))

df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))

I'd like to create a new column in df2, df2$c, which identifies the smallest list (or node) in df1 that contains df2$b.我想在 df2 中创建一个新列 df2$c,它标识 df1 中包含 df2$b 的最小列表(或节点)。 This new column is made by df1$a (the unique identifier).这个新列由 df1$a(唯一标识符)创建。 In the example, df2$c (in order would be)在示例中, df2$c (按顺序)

c("1486,1484,1483,1485,1479,1482,1481,1480")

To take the first two as an example:以前两个为例:

df2$a is c("KC792204", "KF150733")

This can be found in df1$b[1], df1$b[2], df1$b[3], or 1486, 1485, or 1484. Since I am looking for the smallest length list, the result is 1486. 1486 is the smallest length list that contains all labels that are searched.这个可以在df1$b[1], df1$b[2], df1$b[3], or 1486, 1485, or 1484.找到。由于我在找最小长度列表,所以结果是1486。1486是包含所有被搜索标签的最小长度列表。 The next list in df2$b is c("KC792204", "KF150733", "KC791848") . df2$b is c("KC792204", "KF150733", "KC791848") This result is 1484, since only list 1484 in df$1b contains those three labels.这个结果是 1484,因为只有 df$1b 中的列表 1484 包含这三个标签。

I have tried:我努力了:

df2$c <- ifelse(df2$b %in% df1$b, df1$a, 'other')

But I am instead comparing the lists as a whole rather than the elements inside each list.但我将列表作为一个整体而不是每个列表中的元素进行比较。 I also need to find the smallest of the lists that contain the searched labels.我还需要找到包含搜索标签的最小列表。

Here is one option:这是一种选择:

library(data.table) # for %chin%

df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))

df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))

df2$c <- df1$a[
  Rfast::colMaxs(
    outer(
      seq_along(df1$b),
      seq_along(df2$b),
      function(i, j) mapply(
        function(x, y) all(y %chin% x),
        df1$b[i],
        df2$b[j]
      )
    )/lengths(df1$b)
  )
]
df2$c
#> [1] 1486 1484 1483 1485 1479 1482 1481 1480

If it's possible for a row to have no match, then the above should be modified:如果一行可能不匹配,则应修改上述内容:

m <- outer(
  seq_along(df1$b),
  seq_along(df2$b),
  function(i, j) mapply(
    function(x, y) all(y %chin% x),
    df1$b[i],
    df2$b[j]
  )
)
df2$c <- ifelse(colSums(m) == 0L, NA, df1$a[Rfast::colMaxs(m/lengths(df1$b))])

Here is an approach, using data.table , and a helper function这是一种方法,使用data.table和助手 function

library(data.table)
setDT(df1)[, l:=sapply(b,length)]
f <- function(k) df1[sapply(df1$b,\(i) all(k %chin% i))][l==min(l),a]
setDT(df2)[, c:=sapply(b,f)]

Output (df2) Output (df2)

       a                                   b     c
   <num>                              <list> <num>
1:  8645                   KC792204,KF150733  1486
2:  1247          KC792204,KC792205,KC791848  1484
3:  5879                   KJ201900,KF535207  1483
4:  1548 KC792204,JX987740,KX148108,JX987724  1485
5:  2487 GU731662,GU731661,MT826960,GU731663  1479
6:  1245 KJ201900,KJ201899,AB817119,AB817100  1482
7:  1247          GU731661,KP319229,KY428876  1481
8:  3695                   GU731662,MT826960  1480

Explanation:解释:

  • Line 1: load library第 1 行:加载库
  • Line 2: adds a column to df1 that indicates the length ( l ) of the vector in b第 2 行:向df1添加一列,指示b中向量的长度 ( l )
  • Line 3: defines helper function ( f ) that receives a character vector ( k ), checks to limit the rows in df1 to those for which all elements of k are found in b , and of these rows, returns the a value for which l is minimized第 3 行:定义帮助器 function ( f ),它接收字符向量 ( k ),检查以将df1中的行限制为在b中找到k的所有元素的行,并且在这些行中,返回a值,其中l被最小化
  • Line 4: applies f to each value of b in df2 , assigning the result to c第 4 行:将f应用于df2b的每个值,将结果分配给c

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM