简体   繁体   English

根据R中的相似名称计算两列之间的差异

[英]Calculating the difference between two columns based on similar names in R

I have a large dataframe with 400 columns of baseline and follow-up scores (and 10,000 subjects).我有一个包含 400 列基线和后续分数(以及 10,000 名受试者)的大型数据框。 Each alphabet represents a score and I would like to calculate the difference between the follow-up and baseline for each score in a new column:每个字母代表一个分数,我想计算新列中每个分数的后续和基线之间的差异:

subid子项 a_score.baseline a_score.baseline a_score.followup a_score.followup b_score.baseline b_score.baseline b_score.followup b_score.followup c_score.baseline c_score.baseline c_score.followup c_score.followup
1 1 100 100 150 150 5 5 2 2 80 80 70 70
2 2 120 120 142 142 10 10 9 9 79 79 42 42
3 3 111 111 146 146 60 60 49 49 89 89 46 46
4 4 152 152 148 148 4 4 4 4 69 69 48 48
5 5 110 110 123 123 20 20 18 18 60 60 23 23
6 6 112 112 120 120 5 5 3 3 12 12 20 20
7 7 111 111 145 145 6 6 4 4 11 11 45 45

I'd like to calculate the difference between followup and baseline for each score in a new column like this:我想计算新列中每个分数的后续和基线之间的差异,如下所示:

df$a_score_difference = df$a_score.followup - df$a_score.baseleine 

Any ideas on how to do this efficiently?关于如何有效地做到这一点的任何想法? I really appreciate your help.我真的很感谢你的帮助。

code to generate sample data:生成示例数据的代码:

subid <- c(1:7)
a_score.baseline <- c(100,120,111,152,110,112,111)
a_score.followup <- c(150,142,146,148,123,120,145)
b_score.baseline <- c(5,10,60,4,20,5,6)
b_score.followup <- c(2,9,49,4,18,3,4)
c_score.baseline <- c(80,79,89,69,60,12,11)
c_score.followup <- c(70,42,46,48,23,20,45)

df <- data.frame(subid,a_score.baseline,a_score.followup,b_score.baseline,b_score.followup,c_score.baseline,c_score.followup)

base R碱基R

scores <- sort(grep("score\\.(baseline|followup)", names(df), value = TRUE))
scores
# [1] "a_score.baseline" "a_score.followup" "b_score.baseline" "b_score.followup" "c_score.baseline" "c_score.followup"
scores <- split(scores, sub(".*_", "", scores))
scores
# $score.baseline
# [1] "a_score.baseline" "b_score.baseline" "c_score.baseline"
# $score.followup
# [1] "a_score.followup" "b_score.followup" "c_score.followup"
Map(`-`, df[scores[[2]]], df[scores[[1]]])
# $a_score.followup
# [1] 50 22 35 -4 13  8 34
# $b_score.followup
# [1]  -3  -1 -11   0  -2  -2  -2
# $c_score.followup
# [1] -10 -37 -43 -21 -37   8  34
out <- Map(`-`, df[scores[[2]]], df[scores[[1]]])
names(out) <- sub("followup", "difference", names(out))
df <- cbind(df, out)
df
#   subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference
# 1     1              100              150                5                2               80               70                 50
# 2     2              120              142               10                9               79               42                 22
# 3     3              111              146               60               49               89               46                 35
# 4     4              152              148                4                4               69               48                 -4
# 5     5              110              123               20               18               60               23                 13
# 6     6              112              120                5                3               12               20                  8
# 7     7              111              145                6                4               11               45                 34
#   b_score.difference c_score.difference
# 1                 -3                -10
# 2                 -1                -37
# 3                -11                -43
# 4                  0                -21
# 5                 -2                -37
# 6                 -2                  8
# 7                 -2                 34

There exists (in an unsupervised mode) the possibility that not all followup s will have comparable baseline s, which could cause a problem.存在(在无监督模式下)并非所有followup s 都具有可比较的baseline s 的可能性,这可能会导致问题。 You might include a test to validate the presence and order:您可能包括一个测试来验证存在和顺序:

all(sub("baseline", "followup", scores$score.baseline) == scores$score.followup)
# [1] TRUE

dplyr dplyr

You might consider pivoting the data into a more long format.您可能会考虑将数据转换为更长的格式。 This can be done in base R as well, but looks a lot simpler when done here:这也可以在基础 R 中完成,但在这里完成时看起来要简单得多:

library(dplyr)
# library(tidyr) # pivot_*
df %>%
  tidyr::pivot_longer(
    -subid,
    names_pattern = "(.*)_score.(.*)", 
    names_to = c("ltr", ".value")) %>%
  mutate(difference = followup - baseline)
# # A tibble: 21 x 5
#    subid ltr   baseline followup difference
#    <int> <chr>    <dbl>    <dbl>      <dbl>
#  1     1 a          100      150         50
#  2     1 b            5        2         -3
#  3     1 c           80       70        -10
#  4     2 a          120      142         22
#  5     2 b           10        9         -1
#  6     2 c           79       42        -37
#  7     3 a          111      146         35
#  8     3 b           60       49        -11
#  9     3 c           89       46        -43
# 10     4 a          152      148         -4
# # ... with 11 more rows

Honestly, I tend to prefer a long format most of the time for many reasons.老实说,出于多种原因,我大部分时间都倾向于使用长格式。 If, however, you want to make it wide again, then但是,如果您想再次使其变宽,那么

df %>%
  tidyr::pivot_longer(
    -subid, names_pattern = "(.*)_score.(.*)", 
    names_to = c("ltr", ".value")) %>%
  mutate(difference = followup - baseline) %>%
  tidyr::pivot_wider(
    names_from = "ltr", 
    values_from = c("baseline", "followup", "difference"), 
    names_glue = "{ltr}_score.{.value}")
# # A tibble: 7 x 10
#   subid a_score.baseline b_score.baseline c_score.baseline a_score.followup b_score.followup c_score.followup a_score.difference b_score.difference c_score.difference
#   <int>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>              <dbl>              <dbl>              <dbl>
# 1     1              100                5               80              150                2               70                 50                 -3                -10
# 2     2              120               10               79              142                9               42                 22                 -1                -37
# 3     3              111               60               89              146               49               46                 35                -11                -43
# 4     4              152                4               69              148                4               48                 -4                  0                -21
# 5     5              110               20               60              123               18               23                 13                 -2                -37
# 6     6              112                5               12              120                3               20                  8                 -2                  8
# 7     7              111                6               11              145                4               45                 34                 -2                 34

dplyr #2 dplyr #2

This is a keep-it-wide (no pivoting), which will be more efficient than the pivot-mutate-pivot above if you have no intention of working on it in a longer format.这是一个 keep-it-wide(无旋转),如果您不打算以更长的格式处理它,它将比上面的 pivot-mutate-pivot 更有效。

df %>%
  mutate(across(
    ends_with("score.followup"),
    ~ . - cur_data()[[sub("followup", "baseline", cur_column())]], 
    .names = "{sub('followup', 'difference', col)}")
  )
#   subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference b_score.difference c_score.difference
# 1     1              100              150                5                2               80               70                 50                 -3                -10
# 2     2              120              142               10                9               79               42                 22                 -1                -37
# 3     3              111              146               60               49               89               46                 35                -11                -43
# 4     4              152              148                4                4               69               48                 -4                  0                -21
# 5     5              110              123               20               18               60               23                 13                 -2                -37
# 6     6              112              120                5                3               12               20                  8                 -2                  8
# 7     7              111              145                6                4               11               45                 34                 -2                 34

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM