简体   繁体   English

在R中循环具有相同后缀的多个变量

[英]Loop on several variables with the same suffix in R

I have a database which looks like this but with much more rows and columns.我有一个看起来像这样但有更多行和列的数据库。

Several variables (x,y,z) measured at different time (1,2,3).在不同时间 (1,2,3) 测量的几个变量 (x,y,z)。

df <-
  tibble(
    x1 = rnorm(10),
    x2 = rnorm(10),
    x3 = rnorm(10),
    y1 = rnorm(10),
    y2 = rnorm(10),
    y3 = rnorm(10),
    z1 = rnorm(10),
    z2 = rnorm(10),
    z3 = rnorm(10),
  )

I am trying to create dummies variables from the variables with the same suffix (measured at the same time) like this:我正在尝试从具有相同后缀(同时测量)的变量创建虚拟变量,如下所示:

df <- df %>% 
  mutate(var1= ifelse(x1>0 & (y1<0.5 |z1<0.5),0,1)) %>% 
  mutate(var2= ifelse(x2>0 & (y2<0.5 |z2<0.5),0,1)) %>%
  mutate(var3= ifelse(x3>0 & (y1<0.5 |z3<0.5),0,1)) 

I am used to coding in SAS or Stata, so I would like to use a function or a loop because I have many more variables in my database.我习惯用 SAS 或 Stata 编码,所以我想使用函数或循环,因为我的数据库中有更多变量。 But I think I don't have the right approach in R to deal with this.但我认为我在 R 中没有正确的方法来处理这个问题。

Thank you very much for your help !非常感谢您的帮助 !

{dplyover} makes this kind of operation easy (disclaimer: I'm the maintainer), given that your desired output contains a typo: {dplyover} 使这种操作变得容易(免责声明:我是维护者),因为您想要的输出包含一个错字:

I think you want to use all variables with the same digit (1, 2, 3 and so on) in each calculation:我认为您想在每次计算中使用具有相同数字(1、2、3 等)的所有变量:

df <- df %>% 
  mutate(var1= ifelse(x1>0 & (y1<0.5 |z1<0.5),0,1)) %>% 
  mutate(var2= ifelse(x2>0 & (y2<0.5 |z2<0.5),0,1)) %>%
  mutate(var3= ifelse(x3>0 & (y3<0.5 |z3<0.5),0,1)) 

If that is the case we can use dplyover::over to apply the same function over a vector.如果是这种情况,我们可以使用dplyover::over在向量上应用相同的函数。 Here we construct the vector with extract_names("[0-9]{1}$") which gets us all ending numbers of our variable names here: c(1,2,3) .在这里,我们使用extract_names("[0-9]{1}$")构造向量,它在此处获取变量名称的所有结尾数字: c(1,2,3) We can then construct the variable names using a special syntax: .("x{.x}") .然后,我们可以使用特殊语法构造变量名称: .("x{.x}") Here .x evaluates to the first number in our vector so it would return the object name x1 (not a string!) which we can use inside the function argument of over .这里.x计算为向量中的第一个数字,因此它将返回对象名称x1 (不是字符串!),我们可以在over的函数参数中使用它。

library(dplyr)
library(dplyover) # Only on GitHub: https://github.com/TimTeaFan/dplyover

df %>% 
  mutate(over(cut_names("^[a-z]{1}"),
              ~ ifelse(.("x{.x}") > 0 & (.("y{.x}") < 0.5 | .("z{.x}") < 0.5), 0, 1),
              .names = "var{x}"
              ))

#> # A tibble: 10 x 12
#>        x1      x2      x3      y1     y2     y3     z1     z2       z3  var1
#>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl> <dbl>
#>  1  0.690  0.550   0.911   0.203  -0.111  0.530 -2.09   0.189  0.147       0
#>  2 -0.238  1.32   -0.145   0.744   1.05  -0.448  2.05  -1.04   1.50        1
#>  3  0.888  0.898  -1.46   -1.87   -1.14   1.59   1.91  -0.155  1.46        0
#>  4 -2.78  -1.34   -0.486  -0.0674  0.246  0.141  0.154  1.08  -0.319       1
#>  5 -1.20   0.835   1.28   -1.32   -0.674  0.115  0.362  1.06   0.515       1
#>  6  0.622 -0.713   0.0525  1.79   -0.427  0.819 -1.53  -0.885  0.00237     0
#>  7 -2.54   0.0197  0.942   0.230  -1.37  -1.02  -1.55  -0.721 -1.06        1
#>  8 -0.434  1.97   -0.274   0.848  -0.482 -0.422  0.197  0.497 -0.600       1
#>  9 -0.316 -0.219   0.467  -1.97   -0.718 -0.442 -1.39  -0.877  1.52        1
#> 10 -1.03   0.226   2.04    0.432  -1.02  -0.535  0.954 -1.11   0.804       1
#> # ... with 2 more variables: var2 <dbl>, var3 <dbl>

Alternatively we can use dplyr::across and use cur_column() , get() and gsub() to alter the name of the column on the fly.或者,我们可以使用dplyr::across并使用cur_column()get()gsub()来即时更改列的名称。 To name the new variables correctly we use gsub() in the .names argument of across and wrap it in curly braces {} to evaluate the expression.为了正确命名新变量,我们在 cross 的.names参数中使用gsub()并将其包裹across大括号{}中以评估表达式。

library(dplyr)

df %>% 
  mutate(across(starts_with("x"),
                ~ {
                  cur_c <- dplyr::cur_column()
                  ifelse(.x > 0 & (get(gsub("x","y", cur_c)) < 0.5 | get(gsub("x","z", cur_c)) < 0.5), 0, 1)
                },
                .names = '{gsub("x", "var", .col)}'
                ))

#> # A tibble: 10 x 12
#>         x1      x2     x3     y1      y2     y3      z1      z2      z3  var1
#>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl> <dbl>
#>  1 -0.423  -1.42   -1.15  -1.54   1.92   -0.511 -0.739   0.501   0.451      1
#>  2 -0.358   0.164   0.971 -1.61   1.96   -0.675 -0.0188 -1.88    1.63       1
#>  3 -0.453  -0.758  -0.258 -0.449 -0.795  -0.362 -1.81   -0.780  -1.90       1
#>  4  0.855   0.335  -1.36   0.796 -0.674  -1.37  -1.42   -1.03   -0.560      0
#>  5  0.436  -0.0487 -0.639  0.352 -0.325  -0.893 -0.746   0.0548 -0.394      0
#>  6 -0.228  -0.240  -0.854 -0.197  0.884   0.118 -0.0713  1.09   -0.0289     1
#>  7 -0.949  -0.231   0.428  0.290 -0.803   2.15  -1.11   -0.202  -1.21       1
#>  8  1.88   -0.0980 -2.60  -1.86  -0.0258 -0.965 -1.52   -0.539   0.108      0
#>  9  0.221   1.58   -1.46  -0.806  0.749   0.506  1.09    0.523   1.86       0
#> 10  0.0238 -0.389  -0.474  0.512 -0.448   0.178  0.529   1.56   -1.12       1
#> # ... with 2 more variables: var2 <dbl>, var3 <dbl>

Created on 2022-06-08 by the reprex package (v2.0.1)reprex 包(v2.0.1)于 2022-06-08 创建

You could restructure your data along the principles of tidy data (see eg https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html ).您可以按照整洁数据的原则重构您的数据(参见例如https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html )。

Here to a long format and using tidyverse :这里是长格式并使用tidyverse

library(tidyverse)

df <-
  df |>
  pivot_longer(everything()) |>
  separate(name, c("var", "time"), sep = "(?=[0-9])") |>
  pivot_wider(id_col = "time",
              names_from = "var",
              names_prefix = "var_",
              values_from = "value",
              values_fn = list) |>
  unnest(-time) |>
  mutate(new_var = ifelse(var_x > 0 & (var_y < 0.5 | var_z < 0.5), 0, 1))

  df

You would probably want to keep the data in a long format , but if you want, you can pivot_wider and get back to the format you started with.您可能希望将数据保留为长格式,但如果您愿意,您可以pivot_wider并返回到您开始使用的格式。 Eg例如

df |>
  pivot_wider(values_from = c(starts_with("var_"), "new_var"),
              names_from = "time",
              values_fn = list) |> 
  unnest(everything())

As you suggested, a solution using a loop is definitely possible.正如您所建议的,使用循环的解决方案绝对是可能的。

# times as unique non-alphabetical parts of column names
times <- unique(gsub('[[:alpha:]]', '', names(df)))
for (time in times) {
  
  # column names for current time
  xyz <- paste0(c('x', 'y', 'z'), time)
  df[[paste0('var', time)]] <- 
    ifelse(df[[xyz[1]]]>0 & (df[[xyz[2]]]<.5 | df[[xyz[3]]]<.5), 0, 1)
}

Another way I can think of is transforming the data into a 3D array (observartion × variable × time) so that you can actually do the computation for all variables at once.我能想到的另一种方法是将数据转换为 3D 数组(观察 × 变量 × 时间),这样您就可以一次真正地对所有变量进行计算。

times <- unique(gsub('[[:alpha:]]', '', names(df)))
df.arr <- sapply(c('x', 'y', 'z'), 
                 function(var) as.matrix(df[, paste0(var, times)]), 
                 simplify='array')
new.vars <- ifelse(df.arr[, , 1]>0 & (df.arr[, , 2]<0.5 | df.arr[, , 3]<0.5), 0, 1)
colnames(new.vars) <- paste0('var', times)
cbind(df, new.vars)

Here, sapply creates a matrix from columns of measurings for each variable at different times and stacks them into a 3D array.在这里, sapply根据每个变量在不同时间的测量值列创建一个矩阵,并将它们堆叠成一个 3D 数组。

If you trust (or ensure) correct ordering of columns in the data frame, instead of using sapply you can create the array just by modifying the object's dimensions.如果您信任(或确保)数据框中列的正确排序,而不是使用sapply ,您可以仅通过修改对象的维度来创建数组。 I didn't do any benchmarking but i guess this could be the most computationally efficient solution (if it should matter).我没有做任何基准测试,但我想这可能是计算效率最高的解决方案(如果重要的话)。

df.arr <- as.matrix(df)
dim(df.arr) <- c(dim(df.arr) / c(1, 3), 3)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM