简体   繁体   English

计算一个数据框列与其他数据框的所有列之间的相交值的数量

[英]Calculate number of intersecting values between one data frame column and all columns of other data frames

I have some tables that I've extracted from a database via RODBC .我有一些通过RODBC从数据库中提取的表。 The first has a primary key field __ID .第一个有一个主键字段__ID

dfA  <- data.frame(
`__ID` = c("a1","a2","a3"), 
col=c(1,2,3), 
check.names = FALSE )

  __ID col
1   a1   1
2   a2   2
3   a3   3

And the second has foreign key fields that start with _ID .第二个有以_ID开头的外键字段。

dfB  <- data.frame(
"_ID0" = c("z1", "z2", "z3"), 
"_ID1" = c("a1", "b1", "c1"), 
`_ID2` = c("a1", "a2", "c1"), 
`_ID3` = c("a1", "a2", "a3"), 
check.names = FALSE  )

  _ID0 _ID1 _ID2 _ID3
1   z1   a1   a1   a1
2   z2   b1   a2   a2
3   z3   c1   c1   a3

I would like to produce the following data frame which has the names of the two tables above and has all pairwise combinations between the primary key field from the first table and the foreign key fields from the other table.我想生成以下数据框,它具有上面两个表的名称,并且在第一个表中的主键字段和另一个表中的外键字段之间具有所有成对组合。 For each pair, it shows the number of intersecting values in a column called intersects .对于每一对,它在名为intersects的列中显示相交值的数量。

matches  <- data.frame(
pk_table = "dfA", 
pk=c("__ID", "__ID","__ID","__ID"), 
fk_table= c("dfB", "dfB","dfB","dfB"), 
fk=c("_ID0", "_ID1", "_ID2", "_ID3"), 
intersects=c(0, 1,2,3), 
check.names = FALSE )

  pk_table   pk fk_table   fk intersects
1      dfA __ID      dfB _ID0          0
2      dfA __ID      dfB _ID1          1
3      dfA __ID      dfB _ID2          2
4      dfA __ID      dfB _ID3          3

Here's an example of how a single value for the intersects column could be calculated.下面是如何计算intersects列的单个值的示例。 The value 1 is returned because the __ID column has one value that's also found in _ID1 .返回值 1 是因为__ID列有一个也在_ID1找到的_ID1

length( intersect(dfA$`__ID`, dfB$`_ID1`) )

How can I create the above without loops?如何在没有循环的情况下创建上述内容? I would ideally like to have a solution that accepts the following inputs:理想情况下,我希望有一个接受以下输入的解决方案:

  • table name and column name of primary key field主键字段的表名和列名
  • all additional data structures ( dfB , dfC , etc)所有附加数据结构( dfBdfC等)

The function should then count all matches between the primary key field and all other columns of all other data structures provided.然后,该函数应计算主键字段与提供的所有其他数据结构的所有其他列之间的所有匹配项。 In total, my database has 700 columns in 15 tables.我的数据库总共有 15 个表中的 700 列。 My primary key field is in one table and I would like to count how many times the values in this column occur in each of the columns of all 15 tables (including the same table in which it is found).我的主键字段在一个表中,我想计算该列中的值在所有 15 个表(包括在其中找到它的同一个表)的每一列中出现的次数。 I cannot assume that the foreign key columns follow a particular naming convention, but the total amount of data in the database is less than 50MB so I don't expect performance issues.我不能假设外键列遵循特定的命名约定,但数据库中的数据总量小于 50MB,所以我不希望出现性能问题。

This should do the trick:这应该可以解决问题:

library(dplyr)
library(tidyr)

options(stringsAsFactors = F)

dfA  <- data.frame(
  `__ID` = c("a1","a2","a3"), 
  col=c(1,2,3),
  check.names = FALSE )

dfB  <- data.frame(
  fk_table = c("dfB", "dfB","dfB"), #added a column with the table name
  `_ID0` = c("z1", "z2", "z3"), 
  `_ID1` = c("a1", "b1", "c1"), 
  `_ID2` = c("a1", "a2", "c1"), 
  `_ID3` = c("a1", "a2", "a3"), 
  check.names = FALSE  )

dfB%>%
  # first we gather the dataframe to long, tidy format
  gather(key = fk, value = value, `_ID0`:`_ID3`)%>%

  # then we do a left join. 
  # this introduces NA's for values (e.g. c1) that are not in dfA
  left_join(dfA, by = c("value" = "__ID"))%>%

  # Now we group by fk name (e.g. _ID0)
  group_by(fk_table, fk)%>%

  # And we count how often the result is not NA
  # an inner_join followed by counting the rows would be simpler
  # but then you don't get zero values as in the example
  summarise(intersects=sum(!is.na(col)))

This returns the following:这将返回以下内容:

  fk_table    fk intersects
1      dfB  _ID0          0
2      dfB  _ID1          1
3      dfB  _ID2          2
4      dfB  _ID3          3

Only difference is that you don't have the pk and pk_table columns in the end result but I guess it won't be difficult to add this.唯一的区别是最终结果中没有 pk 和 pk_table 列,但我想添加它并不难。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在数据框中的一列和所有其他列之间执行线性回归并将 r 平方值保存在新数据框中? - How to perform linear regression between one column and all other columns in a data frame and save r squared values in new data frame? 在两个数据帧之间相交多列 - Intersecting multiple columns between two data frames 有没有办法自动将彼此下方的 append 数据框列合并为大型数据框列表中的一列? - Is there a way to automatically append data frame columns below each other into one column within large list of data frames? 如何使用R计算单个语句中数据框中一列的配对t检验与所有其他列 - How to calculate a paired t-test for one column in a data frame to all other columns in a single statement using R 通过在第二个数据帧的非常列之间插入一个数据帧的列来连接两个数据帧 - Join two data frames by inserting columns of one data frame in between very columns of the second data frame 计算多个数据帧上列之间的相关性 - Calculate correlation between columns on multiple data frames R计算从一列到所有其他列的值之间的差 - R Calculate the difference between values from one to all the other columns 根据其他列的动态数量全部为真创建新的数据框 boolean 列 - Create new data frame boolean column based on dynamic number of other columns all being true 在r中的其他数据帧中找到数据帧的两列的集合的出现次数 - Find the number of occurrences of a set of two columns of a data frame in other data frames in r 根据列名、ID 号和其他数据框中的键值替换数据框中的值 - Replacing values in a data frame based on column names, ID numbers, and key values from other data frames
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM