[英]conditional match and return values in r
I have two tables named A and B. There are 300k rows in B. Just one row in a.我有两个名为 A 和 B 的表。 B 中有 30 万行。 a 中只有一行。 I want to generate a table C based on if there is a value in B match value in A for each row.
我想根据每一行的 B 中是否有匹配值来生成表 C。 If it is, return 1;
如果是,返回1; If is not, return 0;
如果不是,返回0; Finally, get a matrix C, which includes (0, 1) with the same row as B. I use the Match function in excel, but my data is too large.
最后得到一个矩阵C,其中包括(0, 1)与B同行。我在excel中使用了Match函数,但是我的数据太大了。 Can realize it in R?
可以在R中实现吗?
A: A:
A01B A01C A01D A01E A01F A01G
B:乙:
id1 a A01C NA NA NA
id2 b A01C A01D NA NA
id3 c B01C B03D NA NA
id4 d A01F A01F A01F NA
...
C: C:
A01B A01C A01D A01E A01F A01G
0 1 0 0 0 0
0 1 1 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
Many ways to do this;有很多方法可以做到这一点; here is one I can think of.
这是我能想到的。 (There is probably something super slick and efficient but I think for 300k rows this will be OK).
(可能有一些非常巧妙和高效的东西,但我认为对于 300k 行,这可以)。
First convert your code into a reproducible example.首先将您的代码转换为可重现的示例。
Here A
is a vector in R (read yours in as necessary and coerce to vector)这里
A
是 R 中的一个向量(根据需要读取你的向量并强制转换为向量)
A <- c("A01B", "A01C", "A01D", "A01E", "A01F", "A01G")
I'm using the data.table
package here because I like its syntax.我在这里使用
data.table
包是因为我喜欢它的语法。 You will need to make your B a data.table not just a data.frame你需要让你的 B 成为一个 data.table 而不仅仅是一个 data.frame
library(data.table)
# I used dput(B) to get this command to create a reproducible example
B <- data.table(structure(list(col1 = c("id1", "id2", "id3", "id4"), col2 = c("a",
"b", "c", "d"), col3 = c("A01C", "A01C", "B01C", "A01F"), col4 = c(NA,
"A01D", "B03D", "A01F"), col5 = c(NA, NA, NA, "A01F"), col6 = c(NA_character_,
NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-4L)))
# col1 col2 col3 col4 col5 col6
# <char> <char> <char> <char> <char> <char>
# 1: id1 a A01C <NA> <NA> <NA>
# 2: id2 b A01C A01D <NA> <NA>
# 3: id3 c B01C B03D <NA> <NA>
# 4: id4 d A01F A01F A01F <NA>
Now to your problem.现在解决你的问题。 Answer then explanation.
先回答再解释。 Answer:
回答:
> col_names <- tail(names(B), -2)
> B[,
sapply(
A,
function (code) { pmin(1, rowSums(.SD == code, na.rm=T)) },
simplify=F, USE.NAMES=T
),
.SDcols=col_names
]
A01B A01C A01D A01E A01F A01G
<num> <num> <num> <num> <num> <num>
1: 0 1 0 0 0 0
2: 0 1 1 0 0 0
3: 0 0 0 0 0 0
4: 0 0 0 0 1 0
Edit : just realised it's way easier to read if you ditch the data frame and just use a matrix of all but your first 2 columns of B!编辑:刚刚意识到,如果您放弃数据框并只使用除 B 的前两列之外的所有矩阵,它会更容易阅读! Your result will also be a matrix rather than a data frame.
您的结果也将是一个矩阵而不是一个数据框。
# B[, ..col_names] if using a data.table
# B[, col_names] if using a data.frame
sapply(A, function (code) { pmin(1, rowSums(B[, ..col_names] == code, na.rm=T)) })
A01B A01C A01D A01E A01F A01G
[1,] 0 1 0 0 0 0
[2,] 0 1 1 0 0 0
[3,] 0 0 0 0 0 0
[4,] 0 0 0 0 1 0
Explanation: First presume I only have one code 'A01C' and am just trying to produce the A01C column.说明:首先假设我只有一个代码“A01C”并且我只是想生成 A01C 列。
First make a vector of column names we want to check (everything except the first 2)首先制作一个我们要检查的列名的向量(除前两个之外的所有内容)
col_names <- tail(names(B), -2)
Then check if any of these columns is A01C (the .SDcols=col_names
just selects columns 3 to 6)然后检查这些列中是否有任何列是 A01C(
.SDcols=col_names
只选择列 3 到 6)
# this is TRUE if the column has A01C in it.
> B[, .SD == 'A01C', .SDcols=col_names]
col3 col4 col5 col6
[1,] TRUE NA NA NA
[2,] TRUE FALSE NA NA
[3,] FALSE FALSE NA NA
[4,] FALSE FALSE FALSE NA
But we want to combine these to one value per row.但是我们希望将这些组合为每行一个值。 We can do this by adding the TRUEs in each row, which returns the number of matches.
我们可以通过在每行中添加 TRUE 来实现,这将返回匹配的数量。
rowSums
will do this. rowSums
将执行此操作。 I add na.rm=T
to treat the NA as 0. The .(A01C=rowSums(...))
syntax just says "make the output a column called A01C".我添加
na.rm=T
将 NA 视为 0。 .(A01C=rowSums(...))
语法只是说“使输出成为名为 A01C 的列”。
# But we wnat to condense this to one value per row.
> B[, .(A01C=rowSums(.SD == 'A01C', na.rm=T) > 0), .SDcols=col_names]
A01C
<num>
1: 1
2: 1
3: 0
4: 0
Great, so now we just have to loop over every code in A
and do this for each.太好了,所以现在我们只需要遍历
A
每个代码并为每个代码执行此操作。
> B[,
sapply(
A,
function (code) { rowSums(.SD == code, na.rm=T) },
simplify=F, USE.NAMES=T
),
.SDcols=col_names
]
A01B A01C A01D A01E A01F A01G
<num> <num> <num> <num> <num> <num>
1: 0 1 0 0 0 0
2: 0 1 1 0 0 0
3: 0 0 0 0 0 0
4: 0 0 0 0 3 0
Except note that this returns the number of matches (eg the A01F column row 4 has '3' rather than '1' because there are 3 A01Fs in that row).除了注意这会返回匹配的数量(例如,A01F 列第 4 行有“3”而不是“1”,因为该行中有 3 个 A01F)。 You seem to want just a 1 or 0, so we can just take the minimum of each number and 1 (or we could do a > 0 check and convert to numeric, doesn't matter).
你似乎只想要一个 1 或 0,所以我们可以取每个数字和 1 中的最小值(或者我们可以做一个 > 0 检查并转换为数字,没关系)。 To do this we change
rowSums(...)
to pmin(1, rowSums(...))
and get the desired result already posted above.为此,我们将
rowSums(...)
更改为pmin(1, rowSums(...))
并获得上面已发布的所需结果。
You can combine the column values in B
into one column using tidyr::unite
and then expand them into 1/0 values using cSplit_e
from splitstackshape
您可以使用
tidyr::unite
将B
的列值合并为一列,然后使用cSplit_e
的splitstackshape
将它们扩展为 1/0 值
result <- B %>%
tidyr::unite(tmp, V3:V4, na.rm = TRUE) %>%
splitstackshape::cSplit_e('tmp', sep = '_', type = 'character', fill = 0)
result
# V1 V2 tmp tmp_A01C tmp_A01D tmp_A01F tmp_B01C tmp_B03D
#1 id1 a A01C 1 0 0 0 0
#2 id2 b A01C_A01D 1 1 0 0 0
#3 id3 c B01C_B03D 0 0 0 1 1
#4 id4 d A01F_A01F 0 0 1 0 0
If there are certain values in A
which are not present in B
at all we can use setdiff
to create those columns in result
.如果有某些价值
A
中不存在的B
在所有我们可以使用setdiff
在创建这些列result
。
result[setdiff(unlist(A), names(result))] <- 0
You can use %in%
with apply
:您可以将
%in%
与apply
一起apply
:
C <- +t(apply(B, 1, "%in%", x=A))
colnames(C) <- A
C
# A01B A01C A01D A01E A01F A01G
#a 0 1 0 0 0 0
#b 0 1 1 0 0 0
#c 0 0 0 0 0 0
#d 0 0 0 0 1 0
Data:数据:
A <- c("A01B", "A01C", "A01D", "A01E", "A01F", "A01G")
B <- read.table(row.names=2, text="
id1 a A01C NA NA NA
id2 b A01C A01D NA NA
id3 c B01C B03D NA NA
id4 d A01F A01F A01F NA")[-1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.