[英]Calculate number of intersecting values between one data frame column and all columns of other data frames
I have some tables that I've extracted from a database via RODBC
.我有一些通过RODBC
从数据库中提取的表。 The first has a primary key field __ID
.第一个有一个主键字段__ID
。
dfA <- data.frame(
`__ID` = c("a1","a2","a3"),
col=c(1,2,3),
check.names = FALSE )
__ID col
1 a1 1
2 a2 2
3 a3 3
And the second has foreign key fields that start with _ID
.第二个有以_ID
开头的外键字段。
dfB <- data.frame(
"_ID0" = c("z1", "z2", "z3"),
"_ID1" = c("a1", "b1", "c1"),
`_ID2` = c("a1", "a2", "c1"),
`_ID3` = c("a1", "a2", "a3"),
check.names = FALSE )
_ID0 _ID1 _ID2 _ID3
1 z1 a1 a1 a1
2 z2 b1 a2 a2
3 z3 c1 c1 a3
I would like to produce the following data frame which has the names of the two tables above and has all pairwise combinations between the primary key field from the first table and the foreign key fields from the other table.我想生成以下数据框,它具有上面两个表的名称,并且在第一个表中的主键字段和另一个表中的外键字段之间具有所有成对组合。 For each pair, it shows the number of intersecting values in a column called intersects
.对于每一对,它在名为intersects
的列中显示相交值的数量。
matches <- data.frame(
pk_table = "dfA",
pk=c("__ID", "__ID","__ID","__ID"),
fk_table= c("dfB", "dfB","dfB","dfB"),
fk=c("_ID0", "_ID1", "_ID2", "_ID3"),
intersects=c(0, 1,2,3),
check.names = FALSE )
pk_table pk fk_table fk intersects
1 dfA __ID dfB _ID0 0
2 dfA __ID dfB _ID1 1
3 dfA __ID dfB _ID2 2
4 dfA __ID dfB _ID3 3
Here's an example of how a single value for the intersects
column could be calculated.下面是如何计算intersects
列的单个值的示例。 The value 1 is returned because the __ID
column has one value that's also found in _ID1
.返回值 1 是因为__ID
列有一个也在_ID1
找到的_ID1
。
length( intersect(dfA$`__ID`, dfB$`_ID1`) )
How can I create the above without loops?如何在没有循环的情况下创建上述内容? I would ideally like to have a solution that accepts the following inputs:理想情况下,我希望有一个接受以下输入的解决方案:
dfB
, dfC
, etc)所有附加数据结构( dfB
、 dfC
等)The function should then count all matches between the primary key field and all other columns of all other data structures provided.然后,该函数应计算主键字段与提供的所有其他数据结构的所有其他列之间的所有匹配项。 In total, my database has 700 columns in 15 tables.我的数据库总共有 15 个表中的 700 列。 My primary key field is in one table and I would like to count how many times the values in this column occur in each of the columns of all 15 tables (including the same table in which it is found).我的主键字段在一个表中,我想计算该列中的值在所有 15 个表(包括在其中找到它的同一个表)的每一列中出现的次数。 I cannot assume that the foreign key columns follow a particular naming convention, but the total amount of data in the database is less than 50MB so I don't expect performance issues.我不能假设外键列遵循特定的命名约定,但数据库中的数据总量小于 50MB,所以我不希望出现性能问题。
This should do the trick:这应该可以解决问题:
library(dplyr)
library(tidyr)
options(stringsAsFactors = F)
dfA <- data.frame(
`__ID` = c("a1","a2","a3"),
col=c(1,2,3),
check.names = FALSE )
dfB <- data.frame(
fk_table = c("dfB", "dfB","dfB"), #added a column with the table name
`_ID0` = c("z1", "z2", "z3"),
`_ID1` = c("a1", "b1", "c1"),
`_ID2` = c("a1", "a2", "c1"),
`_ID3` = c("a1", "a2", "a3"),
check.names = FALSE )
dfB%>%
# first we gather the dataframe to long, tidy format
gather(key = fk, value = value, `_ID0`:`_ID3`)%>%
# then we do a left join.
# this introduces NA's for values (e.g. c1) that are not in dfA
left_join(dfA, by = c("value" = "__ID"))%>%
# Now we group by fk name (e.g. _ID0)
group_by(fk_table, fk)%>%
# And we count how often the result is not NA
# an inner_join followed by counting the rows would be simpler
# but then you don't get zero values as in the example
summarise(intersects=sum(!is.na(col)))
This returns the following:这将返回以下内容:
fk_table fk intersects
1 dfB _ID0 0
2 dfB _ID1 1
3 dfB _ID2 2
4 dfB _ID3 3
Only difference is that you don't have the pk and pk_table columns in the end result but I guess it won't be difficult to add this.唯一的区别是最终结果中没有 pk 和 pk_table 列,但我想添加它并不难。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.