简体   繁体   English

r 中的条件匹配和返回值

[英]conditional match and return values in r

I have two tables named A and B. There are 300k rows in B. Just one row in a.我有两个名为 A 和 B 的表。 B 中有 30 万行。 a 中只有一行。 I want to generate a table C based on if there is a value in B match value in A for each row.我想根据每一行的 B 中是否有匹配值来生成表 C。 If it is, return 1;如果是,返回1; If is not, return 0;如果不是,返回0; Finally, get a matrix C, which includes (0, 1) with the same row as B. I use the Match function in excel, but my data is too large.最后得到一个矩阵C,其中包括(0, 1)与B同行。我在excel中使用了Match函数,但是我的数据太大了。 Can realize it in R?可以在R中实现吗?

A: A:

A01B A01C A01D A01E A01F A01G

B:乙:

id1 a A01C  NA    NA    NA 
id2 b A01C A01D   NA    NA
id3 c B01C B03D   NA    NA
id4 d A01F A01F  A01F   NA
...

C: C:

A01B A01C A01D A01E A01F A01G
 0    1     0    0    0    0
 0    1     1    0    0    0
 0    0     0    0    0    0
 0    0     0    0    1    0

Many ways to do this;有很多方法可以做到这一点; here is one I can think of.这是我能想到的。 (There is probably something super slick and efficient but I think for 300k rows this will be OK). (可能有一些非常巧妙和高效的东西,但我认为对于 300k 行,这可以)。

First convert your code into a reproducible example.首先将您的代码转换为可重现的示例。

Here A is a vector in R (read yours in as necessary and coerce to vector)这里A是 R 中的一个向量(根据需要读取你的向量并强制转换为向量)

A <- c("A01B", "A01C", "A01D", "A01E", "A01F", "A01G")

I'm using the data.table package here because I like its syntax.我在这里使用data.table包是因为我喜欢它的语法。 You will need to make your B a data.table not just a data.frame你需要让你的 B 成为一个 data.table 而不仅仅是一个 data.frame

library(data.table)

# I used dput(B) to get this command to create a reproducible example
B <- data.table(structure(list(col1 = c("id1", "id2", "id3", "id4"), col2 = c("a", 
"b", "c", "d"), col3 = c("A01C", "A01C", "B01C", "A01F"), col4 = c(NA, 
"A01D", "B03D", "A01F"), col5 = c(NA, NA, NA, "A01F"), col6 = c(NA_character_, 
NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA, 
-4L)))

#      col1   col2   col3   col4   col5   col6
#    <char> <char> <char> <char> <char> <char>
# 1:    id1      a   A01C   <NA>   <NA>   <NA>
# 2:    id2      b   A01C   A01D   <NA>   <NA>
# 3:    id3      c   B01C   B03D   <NA>   <NA>
# 4:    id4      d   A01F   A01F   A01F   <NA>

Now to your problem.现在解决你的问题。 Answer then explanation.先回答再解释。 Answer:回答:

> col_names <- tail(names(B), -2)
> B[,
     sapply(
         A,
         function (code) { pmin(1, rowSums(.SD == code, na.rm=T)) },
         simplify=F, USE.NAMES=T
      ),
      .SDcols=col_names
    ]
    A01B  A01C  A01D  A01E  A01F  A01G
   <num> <num> <num> <num> <num> <num>
1:     0     1     0     0     0     0
2:     0     1     1     0     0     0
3:     0     0     0     0     0     0
4:     0     0     0     0     1     0

Edit : just realised it's way easier to read if you ditch the data frame and just use a matrix of all but your first 2 columns of B!编辑:刚刚意识到,如果您放弃数据框并只使用除 B 的前两列之外的所有矩阵,它会更容易阅读! Your result will also be a matrix rather than a data frame.您的结果也将是一个矩阵而不是一个数据框。

# B[, ..col_names] if using a data.table
# B[, col_names] if using a data.frame
sapply(A, function (code) { pmin(1, rowSums(B[, ..col_names] == code, na.rm=T)) })
     A01B A01C A01D A01E A01F A01G
[1,]    0    1    0    0    0    0
[2,]    0    1    1    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    1    0

Explanation: First presume I only have one code 'A01C' and am just trying to produce the A01C column.说明:首先假设我只有一个代码“A01C”并且我只是想生成 A01C 列。

First make a vector of column names we want to check (everything except the first 2)首先制作一个我们要检查的列名的向量(除前两个之外的所有内容)

col_names <- tail(names(B), -2)

Then check if any of these columns is A01C (the .SDcols=col_names just selects columns 3 to 6)然后检查这些列中是否有任何列是 A01C( .SDcols=col_names只选择列 3 到 6)

# this is TRUE if the column has A01C in it. 
> B[, .SD == 'A01C', .SDcols=col_names]
      col3  col4  col5 col6
[1,]  TRUE    NA    NA   NA
[2,]  TRUE FALSE    NA   NA
[3,] FALSE FALSE    NA   NA
[4,] FALSE FALSE FALSE   NA

But we want to combine these to one value per row.但是我们希望将这些组合为每行一个值。 We can do this by adding the TRUEs in each row, which returns the number of matches.我们可以通过在每行中添加 TRUE 来实现,这将返回匹配的数量。 rowSums will do this. rowSums将执行此操作。 I add na.rm=T to treat the NA as 0. The .(A01C=rowSums(...)) syntax just says "make the output a column called A01C".我添加na.rm=T将 NA 视为 0。 .(A01C=rowSums(...))语法只是说“使输出成为名为 A01C 的列”。

# But we wnat to condense this to one value per row.
> B[, .(A01C=rowSums(.SD == 'A01C', na.rm=T) > 0), .SDcols=col_names]
    A01C
   <num>
1:     1
2:     1
3:     0
4:     0

Great, so now we just have to loop over every code in A and do this for each.太好了,所以现在我们只需要遍历A每个代码并为每个代码执行此操作。

> B[,
     sapply(
         A,
         function (code) { rowSums(.SD == code, na.rm=T) },
         simplify=F, USE.NAMES=T
      ),
      .SDcols=col_names
    ]
    A01B  A01C  A01D  A01E  A01F  A01G
   <num> <num> <num> <num> <num> <num>
1:     0     1     0     0     0     0
2:     0     1     1     0     0     0
3:     0     0     0     0     0     0
4:     0     0     0     0     3     0

Except note that this returns the number of matches (eg the A01F column row 4 has '3' rather than '1' because there are 3 A01Fs in that row).除了注意这会返回匹配的数量(例如,A01F 列第 4 行有“3”而不是“1”,因为该行中有 3 个 A01F)。 You seem to want just a 1 or 0, so we can just take the minimum of each number and 1 (or we could do a > 0 check and convert to numeric, doesn't matter).你似乎只想要一个 1 或 0,所以我们可以取每个数字和 1 中的最小值(或者我们可以做一个 > 0 检查并转换为数字,没关系)。 To do this we change rowSums(...) to pmin(1, rowSums(...)) and get the desired result already posted above.为此,我们将rowSums(...)更改为pmin(1, rowSums(...))并获得上面已发布的所需结果。

You can combine the column values in B into one column using tidyr::unite and then expand them into 1/0 values using cSplit_e from splitstackshape您可以使用tidyr::uniteB的列值合并为一列,然后使用cSplit_esplitstackshape将它们扩展为 1/0 值

result <- B %>%
 tidyr::unite(tmp, V3:V4, na.rm = TRUE) %>%
 splitstackshape::cSplit_e('tmp', sep = '_', type = 'character', fill = 0)

result

#   V1 V2       tmp tmp_A01C tmp_A01D tmp_A01F tmp_B01C tmp_B03D
#1 id1  a      A01C        1        0        0        0        0
#2 id2  b A01C_A01D        1        1        0        0        0
#3 id3  c B01C_B03D        0        0        0        1        1
#4 id4  d A01F_A01F        0        0        1        0        0

If there are certain values in A which are not present in B at all we can use setdiff to create those columns in result .如果有某些价值A中不存在的B在所有我们可以使用setdiff在创建这些列result

result[setdiff(unlist(A), names(result))] <- 0

You can use %in% with apply :您可以将%in%apply一起apply

C <- +t(apply(B, 1, "%in%", x=A))
colnames(C) <- A
C
#  A01B A01C A01D A01E A01F A01G
#a    0    1    0    0    0    0
#b    0    1    1    0    0    0
#c    0    0    0    0    0    0
#d    0    0    0    0    1    0

Data:数据:

A <- c("A01B", "A01C", "A01D", "A01E", "A01F", "A01G")
B <- read.table(row.names=2, text="
id1 a A01C  NA    NA    NA 
id2 b A01C A01D   NA    NA
id3 c B01C B03D   NA    NA
id4 d A01F A01F  A01F   NA")[-1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM