[英]Calculating the difference between two columns based on similar names in R
I have a large dataframe with 400 columns of baseline and follow-up scores (and 10,000 subjects).我有一个包含 400 列基线和后续分数(以及 10,000 名受试者)的大型数据框。 Each alphabet represents a score and I would like to calculate the difference between the follow-up and baseline for each score in a new column:
每个字母代表一个分数,我想计算新列中每个分数的后续和基线之间的差异:
subid![]() |
a_score.baseline ![]() |
a_score.followup ![]() |
b_score.baseline ![]() |
b_score.followup ![]() |
c_score.baseline ![]() |
c_score.followup ![]() |
|
---|---|---|---|---|---|---|---|
1 ![]() |
100 ![]() |
150 ![]() |
5 ![]() |
2 ![]() |
80 ![]() |
70 ![]() |
|
2 ![]() |
120 ![]() |
142 ![]() |
10 ![]() |
9 ![]() |
79 ![]() |
42 ![]() |
|
3 ![]() |
111 ![]() |
146 ![]() |
60 ![]() |
49 ![]() |
89 ![]() |
46 ![]() |
|
4 ![]() |
152 ![]() |
148 ![]() |
4 ![]() |
4 ![]() |
69 ![]() |
48 ![]() |
|
5 ![]() |
110 ![]() |
123 ![]() |
20 ![]() |
18 ![]() |
60 ![]() |
23 ![]() |
|
6 ![]() |
112 ![]() |
120 ![]() |
5 ![]() |
3 ![]() |
12 ![]() |
20 ![]() |
|
7 ![]() |
111 ![]() |
145 ![]() |
6 ![]() |
4 ![]() |
11 ![]() |
45 ![]() |
I'd like to calculate the difference between followup and baseline for each score in a new column like this:我想计算新列中每个分数的后续和基线之间的差异,如下所示:
df$a_score_difference = df$a_score.followup - df$a_score.baseleine
Any ideas on how to do this efficiently?关于如何有效地做到这一点的任何想法? I really appreciate your help.
我真的很感谢你的帮助。
code to generate sample data:生成示例数据的代码:
subid <- c(1:7)
a_score.baseline <- c(100,120,111,152,110,112,111)
a_score.followup <- c(150,142,146,148,123,120,145)
b_score.baseline <- c(5,10,60,4,20,5,6)
b_score.followup <- c(2,9,49,4,18,3,4)
c_score.baseline <- c(80,79,89,69,60,12,11)
c_score.followup <- c(70,42,46,48,23,20,45)
df <- data.frame(subid,a_score.baseline,a_score.followup,b_score.baseline,b_score.followup,c_score.baseline,c_score.followup)
scores <- sort(grep("score\\.(baseline|followup)", names(df), value = TRUE))
scores
# [1] "a_score.baseline" "a_score.followup" "b_score.baseline" "b_score.followup" "c_score.baseline" "c_score.followup"
scores <- split(scores, sub(".*_", "", scores))
scores
# $score.baseline
# [1] "a_score.baseline" "b_score.baseline" "c_score.baseline"
# $score.followup
# [1] "a_score.followup" "b_score.followup" "c_score.followup"
Map(`-`, df[scores[[2]]], df[scores[[1]]])
# $a_score.followup
# [1] 50 22 35 -4 13 8 34
# $b_score.followup
# [1] -3 -1 -11 0 -2 -2 -2
# $c_score.followup
# [1] -10 -37 -43 -21 -37 8 34
out <- Map(`-`, df[scores[[2]]], df[scores[[1]]])
names(out) <- sub("followup", "difference", names(out))
df <- cbind(df, out)
df
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference
# 1 1 100 150 5 2 80 70 50
# 2 2 120 142 10 9 79 42 22
# 3 3 111 146 60 49 89 46 35
# 4 4 152 148 4 4 69 48 -4
# 5 5 110 123 20 18 60 23 13
# 6 6 112 120 5 3 12 20 8
# 7 7 111 145 6 4 11 45 34
# b_score.difference c_score.difference
# 1 -3 -10
# 2 -1 -37
# 3 -11 -43
# 4 0 -21
# 5 -2 -37
# 6 -2 8
# 7 -2 34
There exists (in an unsupervised mode) the possibility that not all followup
s will have comparable baseline
s, which could cause a problem.存在(在无监督模式下)并非所有
followup
s 都具有可比较的baseline
s 的可能性,这可能会导致问题。 You might include a test to validate the presence and order:您可能包括一个测试来验证存在和顺序:
all(sub("baseline", "followup", scores$score.baseline) == scores$score.followup)
# [1] TRUE
You might consider pivoting the data into a more long format.您可能会考虑将数据转换为更长的格式。 This can be done in base R as well, but looks a lot simpler when done here:
这也可以在基础 R 中完成,但在这里完成时看起来要简单得多:
library(dplyr)
# library(tidyr) # pivot_*
df %>%
tidyr::pivot_longer(
-subid,
names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline)
# # A tibble: 21 x 5
# subid ltr baseline followup difference
# <int> <chr> <dbl> <dbl> <dbl>
# 1 1 a 100 150 50
# 2 1 b 5 2 -3
# 3 1 c 80 70 -10
# 4 2 a 120 142 22
# 5 2 b 10 9 -1
# 6 2 c 79 42 -37
# 7 3 a 111 146 35
# 8 3 b 60 49 -11
# 9 3 c 89 46 -43
# 10 4 a 152 148 -4
# # ... with 11 more rows
Honestly, I tend to prefer a long format most of the time for many reasons.老实说,出于多种原因,我大部分时间都倾向于使用长格式。 If, however, you want to make it wide again, then
但是,如果您想再次使其变宽,那么
df %>%
tidyr::pivot_longer(
-subid, names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline) %>%
tidyr::pivot_wider(
names_from = "ltr",
values_from = c("baseline", "followup", "difference"),
names_glue = "{ltr}_score.{.value}")
# # A tibble: 7 x 10
# subid a_score.baseline b_score.baseline c_score.baseline a_score.followup b_score.followup c_score.followup a_score.difference b_score.difference c_score.difference
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 100 5 80 150 2 70 50 -3 -10
# 2 2 120 10 79 142 9 42 22 -1 -37
# 3 3 111 60 89 146 49 46 35 -11 -43
# 4 4 152 4 69 148 4 48 -4 0 -21
# 5 5 110 20 60 123 18 23 13 -2 -37
# 6 6 112 5 12 120 3 20 8 -2 8
# 7 7 111 6 11 145 4 45 34 -2 34
This is a keep-it-wide (no pivoting), which will be more efficient than the pivot-mutate-pivot above if you have no intention of working on it in a longer format.这是一个 keep-it-wide(无旋转),如果您不打算以更长的格式处理它,它将比上面的 pivot-mutate-pivot 更有效。
df %>%
mutate(across(
ends_with("score.followup"),
~ . - cur_data()[[sub("followup", "baseline", cur_column())]],
.names = "{sub('followup', 'difference', col)}")
)
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference b_score.difference c_score.difference
# 1 1 100 150 5 2 80 70 50 -3 -10
# 2 2 120 142 10 9 79 42 22 -1 -37
# 3 3 111 146 60 49 89 46 35 -11 -43
# 4 4 152 148 4 4 69 48 -4 0 -21
# 5 5 110 123 20 18 60 23 13 -2 -37
# 6 6 112 120 5 3 12 20 8 -2 8
# 7 7 111 145 6 4 11 45 34 -2 34
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.