r 在数据框列表上应用函数

Question

Help with applying functions over a list of data frames.帮助在数据框列表上应用函数。

I don't often work with lists or functions so following a 3 hour search and test I need some assistance.我不经常使用列表或函数，因此经过 3 小时的搜索和测试后，我需要一些帮助。

I have a list of 2 data frames as follows (real list has 40+):我有如下 2 个数据框的列表（实际列表有 40 多个）：

df1 <- structure(list(ID = 1:4, 
    Period = c("C_2021", "C_2021", "C_2021", "C_2021"), 
    subjects = c(2044L, 2044L, 2058L, 2059L), 
    Q_1_A = c(1L, 1L, 4L, 6L), 
    Q_1_B = c(6L, 1L, 6L, NA), 
    col3 = c(4L, 6L, 5L, 2L), 
    col4 = c(3L, 5L, 4L, 4L)), 
    class = "data.frame", row.names = c(NA, -4L))
        
    df2 <- structure(list(ID = 1:4, 
    Period = c("C_2022", "C_2022", "C_2022", "C_2022"), 
    subjects = c(2058L, 2058L, 2065L, 2066L), 
    Q_1_A = c(2L, 5L, 5L, 6L), 
    Q_1_B = c(6L, 1L, 4L, NA), 
    col3 = c(NA, 6L, 5L, 3L), 
    col4 = c(3L, 6L, 5L, 5L)), 
    class = "data.frame", row.names = c(NA, -4L))

The structure of the datasets are as follows:数据集的结构如下：

    df1
      ID Period subjects Q_1_A Q_1_B col3 col4
    1  1 C_2021     2044     1     6    4    3
    2  2 C_2021     2044     1     1    6    5
    3  3 C_2021     2058     4     6    5    4
    4  4 C_2021     2059     6    NA    2    4
    
    df2
      ID Period subjects Q_1_A Q_1_B col3 col4
    1  1 C_2022     2058     2     6   NA    3
    2  2 C_2022     2058     5     1    6    6
    3  3 C_2022     2065     5     4    5    5
    4  4 C_2022     2066     6    NA    3    5

The list of df's df的列表

dflist <- list(df1, df2)

I would like to do 2 things:我想做两件事：

1. Conditional removal of string before 2nd underscore 1.有条件地删除第二个下划线之前的字符串

I would like to remove characters before the 2nd underscore only in columns beginning with "Q".我想仅在以“Q”开头的列中删除第二个下划线之前的字符。 Column "Q_1_A" would become "A".列“Q_1_A”将变为“A”。 The code should only impact columns starting with "Q".该代码应该只影响以“Q”开头的列。

Note: The ifelse is important - in the real data there are other columns with 2 underscores that cannot be modified, and the columns in data frames may be in different orders so it needs to be done by column name.注意：ifelse很重要——在真实数据中还有其他2个下划线的列不能修改，而且数据框中的列可能有不同的顺序，所以需要按列名来完成。

#doesnt work (cant seem to get purr working either)
    dflist <- lapply(dflist, function(x) {
      names(x) <- ifelse(starts_with(names(x), "Q"), sub("^[^_]*_", "", names(x)), .x)
      x})

2. Once column names are updated, remove columns present on a list. 2. 更新列名后，删除列表中存在的列。
Note: In the real data there are a lot of columns in each df, it's much easier to list the columns to keep rather than remove.注意：在实际数据中，每个 df 中有很多列，列出要保留的列比删除要容易得多。

List of columns to keep below List is structured assuming the gsub above has been complete.假设上面的 gsub 已经完成，要保留在 List 下面的列的列表是结构化的。

col_keep <- c("ID", "Period", "subjects", "A", "B")

#doesnt work
dflist <- lapply(dflist, function(x) {
  x[(names(x) %in% col_keep)]
  x})

**UPDATE** I think actually the following will work
dflist <- lapply(dflist, function(x) 
{x <- x %>% select(any_of(col_keep))})
#is the best way to do it?

Help would be greatly appreciated.帮助将不胜感激。

Answer 1

For the first required apply this对于第一个需要应用这个

dflist <- lapply(dflist, function(x) {
    names(x) <- ifelse(startsWith(names(x), "Q"), 
    gsub("[Q_0-9]+", "" , names(x)), names(x))
    x})

and the second第二个

col_keep <- c("ID", "Period", "subjects", "A", "B")
dflist <- lapply(dflist, function(x) subset(x , select = col_keep))

Answer 2

In base R:在基础 R 中：

lapply(dflist, \(x)setNames(x, sub('^Q([^_]*_){2}', '', names(x)))[col_keep])
[[1]]
  ID Period subjects A  B
1  1 C_2021     2044 1  6
2  2 C_2021     2044 1  1
3  3 C_2021     2058 4  6
4  4 C_2021     2059 6 NA

[[2]]
  ID Period subjects A  B
1  1 C_2022     2058 2  6
2  2 C_2022     2058 5  1
3  3 C_2022     2065 5  4
4  4 C_2022     2066 6 NA

in tidyverse:在 tidyverse 中：

library(tidyverse)
dflist %>%
  map(~rename_with(.,~str_remove(.,'([^_]+_){2}'), starts_with('Q'))%>%
        select(all_of(col_keep)))

[[1]]
  ID Period subjects A  B
1  1 C_2021     2044 1  6
2  2 C_2021     2044 1  1
3  3 C_2021     2058 4  6
4  4 C_2021     2059 6 NA

[[2]]
  ID Period subjects A  B
1  1 C_2022     2058 2  6
2  2 C_2022     2058 5  1
3  3 C_2022     2065 5  4
4  4 C_2022     2066 6 NA

Answer 3

Another solutions using base:使用 base 的另一种解决方案：

# wrap up code for ease of reading
validate_names <- function(df) {

setNames(df, ifelse(grepl("^Q", names(df)), 
         gsub("[Q_0-9]", "", names(df)), names(df)))
}

# lapply to transform list, then subset with character vector
lapply(dflist, validate_names) |> 
lapply(`[`, col_keep)

r 在数据框列表上应用函数

问题描述

3 个解决方案

解决方案1
1 已采纳 2022-06-21 23:43:05

解决方案2
1 2022-06-21 23:50:01

解决方案3
1 2022-06-22 00:01:28

r 在数据框列表上应用函数

问题描述

3 个解决方案

解决方案1 1 已采纳 2022-06-21 23:43:05

解决方案2 1 2022-06-21 23:50:01

解决方案3 1 2022-06-22 00:01:28

解决方案1
1 已采纳 2022-06-21 23:43:05

解决方案2
1 2022-06-21 23:50:01

解决方案3
1 2022-06-22 00:01:28