[英]R: Extract Rows from One Data Frame, Based on Column Names Matching Values from Another Data Frame
我想知道如何基于数据帧A中的某些列名提取数据帧(数据帧A)的一列中的值,该数据帧A包含来自另一个数据帧(数据帧B)的多个列的值。
进一步来说。 我有两个数据框:
数据框A包含出生缺陷的组合。 每行是不同的组合,每列是该组合中包含的缺陷的编号。
# Combinations data frame
combos <- data.frame("combo_no"=c(1:4),
"Defect_A" = c(1,1,1,1),
"Defect_B" = c(3,2,3,4),
"Defect_C" = c(4,4,NA,7),
"Defect_D" = c(5,5,NA,8),
"Defect_E" = c(6,6,NA,NA))
数据框B包含个案。 第一列具有唯一标识符(CASE_ID)。 其余的列是特定出生缺陷的数量,“出生缺陷存在”为“1”,“不存在”为“0”。
# Cases data set
set.seed(99)
CASE_ID = c(1001:1005)
case1 = sample(0:1, 10, replace=TRUE)
case2 = sample(0:1, 10, replace=TRUE)
case3 = sample(0:1, 10, replace=TRUE)
case4 = sample(0:1, 10, replace=TRUE)
case5 = sample(0:1, 10, replace=TRUE)
def<-data.frame(rbind(case1, case2, case3, case4, case5))
colnames(def)<- c(1:10)
cases<-cbind(CASE_ID,def)
期望的输出:我想从数据框A获得CASE_ID的列表,其具有来自数据框B的出生缺陷的组合。我还想指定存在哪个组合。 理想情况下,输出结果如下:
# Desired Output
output <- data.frame("CASE_ID" = c(1002,1003),
"combo_no" = c(3,1))
谢谢您的帮助。
这里的解决方案,长期以来一步一步地评论它:
### my random generated cases DF:
cases
CASE_ID 1 2 3 4 5 6 7 8 9 10
case1 1001 1 0 1 1 1 1 1 0 0 0
case2 1002 1 1 0 1 1 1 0 0 0 0
case3 1003 0 0 1 1 1 0 0 1 0 0
case4 1004 1 0 0 1 0 0 1 1 1 1
case5 1005 1 0 1 1 0 1 0 0 1 0
### initialize vectors to store found results
found_combos <- vector(); found_patients <- vector();
### open loop on combos rows
for (i in 1:nrow(combos)) {
### open empty vector to fill with the numbers that compose the defect
defect_numbers <- vector()
### open loop on column and take the numbers
for (col in colnames(combos)[2:length(colnames(combos))]) {
number <- combos[i, col]
if ( !is.na(number) ) defect_numbers <- append(defect_numbers, number)
}
### sort the vector to avoid mismatch based on order
defect_numbers <- sort( defect_numbers )
### open loop on patients table
for ( pz in 1:nrow(cases) ) {
pz_numbers <- sort( which( cases[pz,] == 1 )-1 )
### first condition: same length
if ( length(pz_numbers) == length(defect_numbers) ) {
### same condition: exacly same numbers
if (all(pz_numbers == defect_numbers)) {
### append to found results vectors
found_patients <- append( found_patients, cases[pz,1] )
found_combos <- append( found_combos, i )
}
}
}
}
output <- data.frame("CASE_ID" = found_patients,
"combo_no" = found_combos)
### result:
output
CASE_ID combo_no
1 1002 2
根据您的评论编辑:
只需将条件从等于%改为%:
### initialize vectors to store found results
found_combos <- vector(); found_patients <- vector();
for (i in 1:nrow(combos)) {
### open empty vector to fill with the numbers that compose the defect
defect_numbers <- vector()
### open loop on column and take the numbers
for (col in colnames(combos)[2:length(colnames(combos))]) {
number <- combos[i, col]
if ( !is.na(number) ) defect_numbers <- append(defect_numbers, number)
}
### sort the vector to avoid mismatch based on order
defect_numbers <- sort( defect_numbers )
### open loop on patients table
for ( pz in 1:nrow(cases) ) {
pz_numbers <- sort( which( cases[pz,] == 1 )-1 )
### only condition: all defect_numbers in combo_numbers vector
if (all(defect_numbers %in% pz_numbers)) {
### append to found results vectors
found_patients <- append( found_patients, cases[pz,1] )
found_combos <- append( found_combos, i )
}
}
}
output <- data.frame("CASE_ID" = found_patients,
"combo_no" = found_combos)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.