根据R中的相似名称计算两列之间的差异

Question

I have a large dataframe with 400 columns of baseline and follow-up scores (and 10,000 subjects).我有一个包含 400 列基线和后续分数（以及 10,000 名受试者）的大型数据框。 Each alphabet represents a score and I would like to calculate the difference between the follow-up and baseline for each score in a new column:每个字母代表一个分数，我想计算新列中每个分数的后续和基线之间的差异：

subid子项	a_score.baseline a_score.baseline	a_score.followup a_score.followup	b_score.baseline b_score.baseline	b_score.followup b_score.followup	c_score.baseline c_score.baseline	c_score.followup c_score.followup
1 1	100 100	150 150	5 5	2 2	80 80	70 70
2 2	120 120	142 142	10 10	9 9	79 79	42 42
3 3	111 111	146 146	60 60	49 49	89 89	46 46
4 4	152 152	148 148	4 4	4 4	69 69	48 48
5 5	110 110	123 123	20 20	18 18	60 60	23 23
6 6	112 112	120 120	5 5	3 3	12 12	20 20
7 7	111 111	145 145	6 6	4 4	11 11	45 45

I'd like to calculate the difference between followup and baseline for each score in a new column like this:我想计算新列中每个分数的后续和基线之间的差异，如下所示：

df$a_score_difference = df$a_score.followup - df$a_score.baseleine

Any ideas on how to do this efficiently?关于如何有效地做到这一点的任何想法？ I really appreciate your help.我真的很感谢你的帮助。

code to generate sample data:生成示例数据的代码：

subid <- c(1:7)
a_score.baseline <- c(100,120,111,152,110,112,111)
a_score.followup <- c(150,142,146,148,123,120,145)
b_score.baseline <- c(5,10,60,4,20,5,6)
b_score.followup <- c(2,9,49,4,18,3,4)
c_score.baseline <- c(80,79,89,69,60,12,11)
c_score.followup <- c(70,42,46,48,23,20,45)

df <- data.frame(subid,a_score.baseline,a_score.followup,b_score.baseline,b_score.followup,c_score.baseline,c_score.followup)

Answer 1

base R碱基R

scores <- sort(grep("score\\.(baseline|followup)", names(df), value = TRUE))
scores
# [1] "a_score.baseline" "a_score.followup" "b_score.baseline" "b_score.followup" "c_score.baseline" "c_score.followup"
scores <- split(scores, sub(".*_", "", scores))
scores
# $score.baseline
# [1] "a_score.baseline" "b_score.baseline" "c_score.baseline"
# $score.followup
# [1] "a_score.followup" "b_score.followup" "c_score.followup"
Map(`-`, df[scores[[2]]], df[scores[[1]]])
# $a_score.followup
# [1] 50 22 35 -4 13  8 34
# $b_score.followup
# [1]  -3  -1 -11   0  -2  -2  -2
# $c_score.followup
# [1] -10 -37 -43 -21 -37   8  34
out <- Map(`-`, df[scores[[2]]], df[scores[[1]]])
names(out) <- sub("followup", "difference", names(out))
df <- cbind(df, out)
df
#   subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference
# 1     1              100              150                5                2               80               70                 50
# 2     2              120              142               10                9               79               42                 22
# 3     3              111              146               60               49               89               46                 35
# 4     4              152              148                4                4               69               48                 -4
# 5     5              110              123               20               18               60               23                 13
# 6     6              112              120                5                3               12               20                  8
# 7     7              111              145                6                4               11               45                 34
#   b_score.difference c_score.difference
# 1                 -3                -10
# 2                 -1                -37
# 3                -11                -43
# 4                  0                -21
# 5                 -2                -37
# 6                 -2                  8
# 7                 -2                 34

There exists (in an unsupervised mode) the possibility that not all followup s will have comparable baseline s, which could cause a problem.存在（在无监督模式下）并非所有followup s 都具有可比较的baseline s 的可能性，这可能会导致问题。 You might include a test to validate the presence and order:您可能包括一个测试来验证存在和顺序：

all(sub("baseline", "followup", scores$score.baseline) == scores$score.followup)
# [1] TRUE

dplyr dplyr

You might consider pivoting the data into a more long format.您可能会考虑将数据转换为更长的格式。 This can be done in base R as well, but looks a lot simpler when done here:这也可以在基础 R 中完成，但在这里完成时看起来要简单得多：

library(dplyr)
# library(tidyr) # pivot_*
df %>%
  tidyr::pivot_longer(
    -subid,
    names_pattern = "(.*)_score.(.*)", 
    names_to = c("ltr", ".value")) %>%
  mutate(difference = followup - baseline)
# # A tibble: 21 x 5
#    subid ltr   baseline followup difference
#    <int> <chr>    <dbl>    <dbl>      <dbl>
#  1     1 a          100      150         50
#  2     1 b            5        2         -3
#  3     1 c           80       70        -10
#  4     2 a          120      142         22
#  5     2 b           10        9         -1
#  6     2 c           79       42        -37
#  7     3 a          111      146         35
#  8     3 b           60       49        -11
#  9     3 c           89       46        -43
# 10     4 a          152      148         -4
# # ... with 11 more rows

Honestly, I tend to prefer a long format most of the time for many reasons.老实说，出于多种原因，我大部分时间都倾向于使用长格式。 If, however, you want to make it wide again, then但是，如果您想再次使其变宽，那么

df %>%
  tidyr::pivot_longer(
    -subid, names_pattern = "(.*)_score.(.*)", 
    names_to = c("ltr", ".value")) %>%
  mutate(difference = followup - baseline) %>%
  tidyr::pivot_wider(
    names_from = "ltr", 
    values_from = c("baseline", "followup", "difference"), 
    names_glue = "{ltr}_score.{.value}")
# # A tibble: 7 x 10
#   subid a_score.baseline b_score.baseline c_score.baseline a_score.followup b_score.followup c_score.followup a_score.difference b_score.difference c_score.difference
#   <int>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>              <dbl>              <dbl>              <dbl>
# 1     1              100                5               80              150                2               70                 50                 -3                -10
# 2     2              120               10               79              142                9               42                 22                 -1                -37
# 3     3              111               60               89              146               49               46                 35                -11                -43
# 4     4              152                4               69              148                4               48                 -4                  0                -21
# 5     5              110               20               60              123               18               23                 13                 -2                -37
# 6     6              112                5               12              120                3               20                  8                 -2                  8
# 7     7              111                6               11              145                4               45                 34                 -2                 34

dplyr #2 dplyr #2

This is a keep-it-wide (no pivoting), which will be more efficient than the pivot-mutate-pivot above if you have no intention of working on it in a longer format.这是一个 keep-it-wide（无旋转），如果您不打算以更长的格式处理它，它将比上面的 pivot-mutate-pivot 更有效。

df %>%
  mutate(across(
    ends_with("score.followup"),
    ~ . - cur_data()[[sub("followup", "baseline", cur_column())]], 
    .names = "{sub('followup', 'difference', col)}")
  )
#   subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference b_score.difference c_score.difference
# 1     1              100              150                5                2               80               70                 50                 -3                -10
# 2     2              120              142               10                9               79               42                 22                 -1                -37
# 3     3              111              146               60               49               89               46                 35                -11                -43
# 4     4              152              148                4                4               69               48                 -4                  0                -21
# 5     5              110              123               20               18               60               23                 13                 -2                -37
# 6     6              112              120                5                3               12               20                  8                 -2                  8
# 7     7              111              145                6                4               11               45                 34                 -2                 34

根据R中的相似名称计算两列之间的差异

问题描述

1 个解决方案

解决方案1
3 2022-05-11 19:40:11

base R碱基R

dplyr dplyr

dplyr #2 dplyr #2

根据R中的相似名称计算两列之间的差异

问题描述

1 个解决方案

解决方案1 3 2022-05-11 19:40:11

base R碱基R

dplyr dplyr

dplyr #2 dplyr #2

解决方案1
3 2022-05-11 19:40:11