简体   繁体   English

在 R 中创建具有字符串模式组合的新列

[英]Creating new columns with combinations of string patterns in R

I have a data frame - in which I have a column with a lengthy string separated by _ .我有一个数据框 - 在其中我有一列用_分隔的长字符串。 Now I am interested in counting the patterns and several possible combinations from the long string.现在我有兴趣从长字符串中计算模式和几种可能的组合。 In the use case I provided below, you can find that I would like to count the occurrence of events A and B but not anything else.在我下面提供的用例中,您会发现我想计算事件 A 和 B 的发生,但不计算其他任何东西。

If A and B repeat like A_B or B_A alone or if they repeats itself n number of times, I want to count them and also if there are several occurrences of those combinations.如果 A 和 B 像A_BB_A单独重复,或者如果它们重复n次,我想计算它们,以及这些组合是否多次出现。

Example data frame:示例数据框:

participant <- c("A", "B", "C")
trial <- c(1,1,2)
string_pattern <- c("A_B_A_C_A_B", "B_A_B_A_C_D_A_B", "A_B_C_A_B")

df <- data.frame(participant, trial, string_pattern)

Expected output:预期输出:

   participant   trial  string_pattern   A_B  B_A  A_B_A  B_A_B B_A_B_A 
1. A               1    A_B_A_C_A_B      2    1    1      0     0
2. B               1    B_A_B_A_C_D_A_B  2    2    1      1     1
3. C               2    A_B_C_A_B        2    0    0      0     0

My code:我的代码:


revised_df <- df%>%
                 dplyr::mutate(A_B = stringr::str_count(string_pattern, "A_B"),
                               B_A = stringr::str_count(string_pattern, "B_A"),
                               B_A_B = string::str_count(string_pattern, "B_A_B"))

My approach gets complicated as the number of combinations increases.随着组合数量的增加,我的方法变得复杂。 Hence, looking for a better solution.因此,寻找更好的解决方案。

You could write a function to solve this:你可以写一个函数来解决这个问题:

m <- function(s){
  a <- seq(nchar(s)-1)
  start <- rep(a, rev(a))
  stop <- ave(start, start, FUN = \(x)seq_along(x)+x)
  b <- substring(s, start, stop)
  gsub('(?<=\\B)|(?=\\B)', '_', b, perl = TRUE)
}

n <- function(x){
  names(x) <- x
  a <- strsplit(gsub("_", '', gsub("_[^AB]+_", ':', x)), ':')
  b <- t(table(stack(lapply(a, \(y)unlist(sapply(y, m))))))
  data.frame(pattern=x, as.data.frame.matrix(b), row.names = NULL)
}
  

n(string_pattern)
          pattern A_B A_B_A B_A B_A_B B_A_B_A
1     A_B_A_C_A_B   2     1   1     0       0
2 B_A_B_A_C_D_A_B   2     1   2     1       1
3       A_B_C_A_B   2     0   0     0       0

Try: This checks each string row for current column name尝试:这会检查每个字符串行的当前列名


library(dplyr)

df |> 
  mutate(A_B = 0, B_A = 0, A_B_A = 0, B_A_B = 0, B_A_B_A = 0) |> 
  mutate(across(A_B:B_A_B_A, ~ str_count(string_pattern, cur_column())))
  participant trial  string_pattern A_B B_A A_B_A B_A_B B_A_B_A
1           A     1     A_B_A_C_A_B   2   1     1     0       0
2           B     1 B_A_B_A_C_D_A_B   2   2     1     1       1
3           C     2       A_B_C_A_B   2   0     0     0       0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM