计算一个数据框列与其他数据框的所有列之间的相交值的数量

Question

I have some tables that I've extracted from a database via RODBC .我有一些通过RODBC从数据库中提取的表。 The first has a primary key field __ID .第一个有一个主键字段__ID 。

dfA  <- data.frame(
`__ID` = c("a1","a2","a3"), 
col=c(1,2,3), 
check.names = FALSE )

  __ID col
1   a1   1
2   a2   2
3   a3   3

And the second has foreign key fields that start with _ID .第二个有以_ID开头的外键字段。

dfB  <- data.frame(
"_ID0" = c("z1", "z2", "z3"), 
"_ID1" = c("a1", "b1", "c1"), 
`_ID2` = c("a1", "a2", "c1"), 
`_ID3` = c("a1", "a2", "a3"), 
check.names = FALSE  )

  _ID0 _ID1 _ID2 _ID3
1   z1   a1   a1   a1
2   z2   b1   a2   a2
3   z3   c1   c1   a3

I would like to produce the following data frame which has the names of the two tables above and has all pairwise combinations between the primary key field from the first table and the foreign key fields from the other table.我想生成以下数据框，它具有上面两个表的名称，并且在第一个表中的主键字段和另一个表中的外键字段之间具有所有成对组合。 For each pair, it shows the number of intersecting values in a column called intersects .对于每一对，它在名为intersects的列中显示相交值的数量。

matches  <- data.frame(
pk_table = "dfA", 
pk=c("__ID", "__ID","__ID","__ID"), 
fk_table= c("dfB", "dfB","dfB","dfB"), 
fk=c("_ID0", "_ID1", "_ID2", "_ID3"), 
intersects=c(0, 1,2,3), 
check.names = FALSE )

  pk_table   pk fk_table   fk intersects
1      dfA __ID      dfB _ID0          0
2      dfA __ID      dfB _ID1          1
3      dfA __ID      dfB _ID2          2
4      dfA __ID      dfB _ID3          3

Here's an example of how a single value for the intersects column could be calculated.下面是如何计算intersects列的单个值的示例。 The value 1 is returned because the __ID column has one value that's also found in _ID1 .返回值 1 是因为__ID列有一个也在_ID1找到的_ID1 。

length( intersect(dfA$`__ID`, dfB$`_ID1`) )

How can I create the above without loops?如何在没有循环的情况下创建上述内容？ I would ideally like to have a solution that accepts the following inputs:理想情况下，我希望有一个接受以下输入的解决方案：

table name and column name of primary key field主键字段的表名和列名
all additional data structures ( dfB , dfC , etc)所有附加数据结构（ dfB 、 dfC等）

The function should then count all matches between the primary key field and all other columns of all other data structures provided.然后，该函数应计算主键字段与提供的所有其他数据结构的所有其他列之间的所有匹配项。 In total, my database has 700 columns in 15 tables.我的数据库总共有 15 个表中的 700 列。 My primary key field is in one table and I would like to count how many times the values in this column occur in each of the columns of all 15 tables (including the same table in which it is found).我的主键字段在一个表中，我想计算该列中的值在所有 15 个表（包括在其中找到它的同一个表）的每一列中出现的次数。 I cannot assume that the foreign key columns follow a particular naming convention, but the total amount of data in the database is less than 50MB so I don't expect performance issues.我不能假设外键列遵循特定的命名约定，但数据库中的数据总量小于 50MB，所以我不希望出现性能问题。

Answer 1

This should do the trick:这应该可以解决问题：

library(dplyr)
library(tidyr)

options(stringsAsFactors = F)

dfA  <- data.frame(
  `__ID` = c("a1","a2","a3"), 
  col=c(1,2,3),
  check.names = FALSE )

dfB  <- data.frame(
  fk_table = c("dfB", "dfB","dfB"), #added a column with the table name
  `_ID0` = c("z1", "z2", "z3"), 
  `_ID1` = c("a1", "b1", "c1"), 
  `_ID2` = c("a1", "a2", "c1"), 
  `_ID3` = c("a1", "a2", "a3"), 
  check.names = FALSE  )

dfB%>%
  # first we gather the dataframe to long, tidy format
  gather(key = fk, value = value, `_ID0`:`_ID3`)%>%

  # then we do a left join. 
  # this introduces NA's for values (e.g. c1) that are not in dfA
  left_join(dfA, by = c("value" = "__ID"))%>%

  # Now we group by fk name (e.g. _ID0)
  group_by(fk_table, fk)%>%

  # And we count how often the result is not NA
  # an inner_join followed by counting the rows would be simpler
  # but then you don't get zero values as in the example
  summarise(intersects=sum(!is.na(col)))

This returns the following:这将返回以下内容：

  fk_table    fk intersects
1      dfB  _ID0          0
2      dfB  _ID1          1
3      dfB  _ID2          2
4      dfB  _ID3          3

Only difference is that you don't have the pk and pk_table columns in the end result but I guess it won't be difficult to add this.唯一的区别是最终结果中没有 pk 和 pk_table 列，但我想添加它并不难。

计算一个数据框列与其他数据框的所有列之间的相交值的数量

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-09-30 12:29:47

计算一个数据框列与其他数据框的所有列之间的相交值的数量

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-09-30 12:29:47

解决方案1
1 已采纳 2016-09-30 12:29:47