[英]Merge/combine rows with same ID and Date in R
I have an excel database like below.我有一个像下面这样的excel数据库。 The Excel database had option to enter only 3 drug details.
Excel 数据库只能选择输入 3 个药物详细信息。 Wherever there are more than 3 drugs, it has been entered into another row with PID and Date.
凡是超过3种药物的,都用PID和Date输入另一行。
Is there a way I can merge the rows in R so that each patient's records will be in a single row?有没有办法合并 R 中的行,以便每个患者的记录都在一行中? In the example below, I need to merge Row 1 & 2 and 4 & 6.
在下面的示例中,我需要合并第 1 行和第 2 行以及第 4 行和第 6 行。
Thanks.谢谢。
Row![]() |
PID ![]() |
Date![]() |
Drug1![]() |
Dose1![]() |
Drug2![]() |
Dose2![]() |
Drug3![]() |
Dose3![]() |
Age![]() |
Place![]() |
---|---|---|---|---|---|---|---|---|---|---|
1 ![]() |
11A ![]() |
25/10/2021 ![]() |
RPG![]() |
12 ![]() |
NAT![]() |
34 ![]() |
QRT![]() |
5 ![]() |
45 ![]() |
PMk ![]() |
2 ![]() |
11A ![]() |
25/10/2021 ![]() |
BET![]() |
10 ![]() |
SET![]() |
43 ![]() |
BLT ![]() |
45 ![]() |
||
3 ![]() |
12B ![]() |
20/10/2021 ![]() |
ATY ![]() |
13 ![]() |
LTP ![]() |
3 ![]() |
CRT![]() |
3 ![]() |
56 ![]() |
GTL ![]() |
4 ![]() |
13A ![]() |
22/10/2021 ![]() |
GGS![]() |
7 ![]() |
GSF ![]() |
12 ![]() |
ERE ![]() |
45 ![]() |
45 ![]() |
RKS ![]() |
5 ![]() |
13A ![]() |
26/10/2021 ![]() |
BRT![]() |
9 ![]() |
ARR ![]() |
4 ![]() |
GSF ![]() |
34 ![]() |
46 ![]() |
GLO![]() |
6 ![]() |
13A ![]() |
22/10/2021 ![]() |
DFS![]() |
5 ![]() |
||||||
7 ![]() |
14B ![]() |
04/08/2021 ![]() |
GDS![]() |
2 ![]() |
TRE ![]() |
55 ![]() |
HHS![]() |
34 ![]() |
25 ![]() |
MTK ![]() |
Up front, the two methods below are completely different, not equivalents in "base R vs dplyr".在前面,下面的两种方法是完全不同的,不是“base R vs dplyr”中的等价物。 I'm sure either can be translated to the other.
我确信两者都可以翻译成另一个。
The premise here is to first reshape/pivot the data longer so that each Drug/Dose is on its own line, renumber them appropriately, and then bring it back to a wide state.这里的前提是首先重新调整/旋转数据更长的时间,以便每个药物/剂量都在自己的行上,适当地重新编号,然后将其恢复到广泛状态。
NOTE : frankly, I usually prefer to deal with data in a long format, so consider keeping it in its state immediately before
pivot_wider
.注意:坦率地说,我通常更喜欢以长格式处理数据,因此请考虑在
pivot_wider
之前立即将其保持在其状态。 This means you'd need to bringAge
andPlace
back into it somehow.这意味着您需要以某种方式将
Age
和Place
带回其中。Why?
为什么? A long format deals very well with many types of aggregation;
长格式可以很好地处理多种类型的聚合;
ggplot2
really really prefers data in the long format;ggplot2
真的很喜欢长格式的数据; I dislike seeing and having to deal with all of theNA
/empty values that will invariably happen with this wide format, since many PIDs don't have (eg)Drug6
or later.我不喜欢看到并且不得不处理这种宽格式总是会发生的所有
NA
/empty 值,因为许多 PID 没有(例如)Drug6
或更高版本。 This seems subjective, but it can really be an objective change/improvement to data-mangling, depending on your workflow.这似乎是主观的,但它确实可以是对数据处理的客观更改/改进,具体取决于您的工作流程。
library(dplyr)
# library(tidyr) # pivot_longer, pivot_wider
dat0 <- select(dat, PID, Date, Age, Place) %>%
group_by(PID, Date) %>%
summarize(across(everything(), ~ .[!is.na(.) & nzchar(trimws(.))][1] ))
dat %>%
select(-Age, -Place) %>%
tidyr::pivot_longer(
-c(Row, PID, Date),
names_to = c(".value", "iter"),
names_pattern = "^([^0-9]+)([123]?)$") %>%
arrange(Row, iter) %>%
group_by(PID, Date) %>%
mutate(iter = row_number()) %>%
select(-Row) %>%
tidyr::pivot_wider(
c("PID", "Date"), names_sep = "",
names_from = "iter", values_from = c("Drug", "Dose")) %>%
left_join(dat0, by = c("PID", "Date"))
# # A tibble: 5 x 16
# # Groups: PID, Date [5]
# PID Date Drug1 Drug2 Drug3 Drug4 Drug5 Drug6 Dose1 Dose2 Dose3 Dose4 Dose5 Dose6 Age Place
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 11A 25/10/2021 RPG NAT QRT BET "SET" "BLT" 12 34 5 10 43 45 45 PMk
# 2 12B 20/10/2021 ATY LTP CRT <NA> <NA> <NA> 13 3 3 NA NA NA 56 GTL
# 3 13A 22/10/2021 GGS GSF ERE DFS "" "" 7 12 45 5 NA NA 45 RKS
# 4 13A 26/10/2021 BRT ARR GSF <NA> <NA> <NA> 9 4 34 NA NA NA 46 GLO
# 5 14B 04/08/2021 GDS TRE HHS <NA> <NA> <NA> 2 55 34 NA NA NA 25 MTK
Notes:笔记:
dat0
early, since Age
and Place
don't really fit into the pivot/renumber/pivot mindset.dat0
早就打破了dat0
,因为Age
和Place
并不真正适合枢轴/重新编号/枢轴思维。 Here's a base R method that splits (according to your grouping criteria: PID
and Date
), finds the Drug/Dose columns that need to be renumbered, renames them, and the merge
s all of the frames back together.这是一个基本的 R 方法,它拆分(根据您的分组标准:
PID
和Date
),找到需要重新编号的 Drug/Dose 列,重命名它们,然后merge
所有帧重新合并在一起。
spl <- split(dat, ave(rep(1L, nrow(dat)), dat[,c("PID", "Date")], FUN = seq_along))
spl
# $`1`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1 1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
# 3 3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
# 4 4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
# 5 5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
# 7 7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK
# $`2`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 2 2 11A 25/10/2021 BET 10 SET 43 BLT 45 NA
# 6 6 13A 22/10/2021 DFS 5 NA NA NA
nms <- lapply(spl, function(x) grep("^(Drug|Dose)", colnames(x), value = TRUE))
nms <- data.frame(i = rep(names(nms), lengths(nms)), oldnm = unlist(nms))
nms$grp <- gsub("[0-9]+$", "", nms$oldnm)
nms$newnm <- paste0(nms$grp, ave(nms$grp, nms$grp, FUN = seq_along))
nms <- split(nms, nms$i)
newspl <- Map(function(x, nm) {
colnames(x)[ match(nm$oldnm, colnames(x)) ] <- nm$newnm
x
}, spl, nms)
newspl[-1] <- lapply(newspl[-1], function(x) x[, c("PID", "Date", grep("^(Drug|Dose)", colnames(x), value = TRUE)), drop = FALSE ])
newspl
# $`1`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1 1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
# 3 3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
# 4 4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
# 5 5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
# 7 7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK
# $`2`
# PID Date Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 2 11A 25/10/2021 BET 10 SET 43 BLT 45
# 6 13A 22/10/2021 DFS 5 NA NA
Reduce(function(a, b) merge(a, b, by = c("PID", "Date"), all = TRUE), newspl)
# PID Date Row Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 1 11A 25/10/2021 1 RPG 12 NAT 34 QRT 5 45 PMk BET 10 SET 43 BLT 45
# 2 12B 20/10/2021 3 ATY 13 LTP 3 CRT 3 56 GTL <NA> NA <NA> NA <NA> NA
# 3 13A 22/10/2021 4 GGS 7 GSF 12 ERE 45 45 RKS DFS 5 NA NA
# 4 13A 26/10/2021 5 BRT 9 ARR 4 GSF 34 46 GLO <NA> NA <NA> NA <NA> NA
# 5 14B 04/08/2021 7 GDS 2 TRE 55 HHS 34 25 MTK <NA> NA <NA> NA <NA> NA
Notes:笔记:
The underlying premise of this is that you want to merge the rows onto previous rows.这样做的基本前提是您希望将行合并到先前的行上。 This means (to me) using
base::merge
or dplyr::full_join
;这意味着(对我而言)使用
base::merge
或dplyr::full_join
; two good links for understanding these concepts, in case you are not aware: How to join (merge) data frames (inner, outer, left, right) , What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?理解这些概念的两个很好的链接,以防万一您不知道: How to join (merge) data frames (inner, external, left, right) , INNER JOIN, LEFT JOIN, RIGHT JOIN 和 FULL JOIN 之间有什么区别?
To do that, we need to determine which rows are duplicates of previous;为此,我们需要确定哪些行与之前的行重复; further, we need to know how many previous same-key rows there are.
此外,我们需要知道有多少以前的相同键行。 There are a few ways to do this, but I think the easiest is with
base::split
.有几种方法可以做到这一点,但我认为最简单的是
base::split
。 In this case, no PID/Date combination has more than two rows, but if you had one combination that mandated a third row, spl
would be length-3, and the resulting names would go out to Drug9
/ Dose9
.在这种情况下,任何 PID/Date 组合都不会超过两行,但如果您有一个组合要求第三行,则
spl
长度将为 3,结果名称将变为Drug9
/ Dose9
。
The second portion ( nms <- ...
) is where we work on the names.第二部分(
nms <- ...
)是我们处理名称的地方。 The first few steps create a nms
dataframe that we'll use to map from old to new names.前几步创建了一个
nms
数据框,我们将使用它来从旧名称映射到新名称。 Since we're concerned about contiguous numbering through all multi-row groups, we aggregate on the base (number removed) of the Drug/Dose names, so that we number all Drug
columns from Drug1
through how many there are.由于我们关注所有多行组的连续编号,因此我们根据药物/剂量名称的基数(已删除的数字)进行聚合,以便我们对来自
Drug1
所有Drug
列进行Drug1
直到有多少列。
Note : this assumes that there are always perfect pairs of Drug#
/ Dose#
;注意:这假设总是有完美的
Drug#
/ Dose#
; if there is ever a mismatch, then the numbering will be suspect.如果有任何不匹配,那么编号将是可疑的。
We end with nms
being a split dataframe, just like spl
of the data.我们以
nms
作为拆分数据帧结束,就像数据的spl
一样。 This is useful and important, since we'll Map
(zip-like lapply
) them together.这很有用而且很重要,因为我们将把它们
Map
在一起(类似 zip 的lapply
)。
The third block updates spl
with the new names.第三个块用新名称更新
spl
。 The result in newspl
is just renaming of the columns so that when we merge them together, no column-duplication will occur. newspl
的结果只是重命名列,这样当我们将它们合并在一起时,不会发生列重复。
One additional step here is removing unrelated columns from the 2nd and subsequent frame in the list.这里的另一个步骤是从列表中的第 2 帧和后续帧中删除不相关的列。 That is, we keep
Age
and Place
in the first such frame but remove it from the rest.也就是说,我们将
Age
和Place
保留在第一个这样的框架中,但将其从其余框架中移除。 My assumption (based on the NA
/empty nature of those fields in duplicate rows) is that we only want to keep the first row's values.我的假设(基于重复行中这些字段的
NA
/empty 性质)是我们只想保留第一行的值。
The last step is to iteratively merge
them together.最后一步是迭代地
merge
它们merge
在一起。 The Reduce
function is nice for this. Reduce
函数对此很好。
Another tidyverse
-based solution, with a pivot_longer
followed by a pivot_wider
:另一个基于
tidyverse
的解决方案,带有一个pivot_longer
后跟一个pivot_wider
:
library(tidyverse)
# Note that my dataframe does not contain column Row
df %>%
mutate(across(starts_with("Dose"), as.character)) %>%
pivot_longer(!c(PID, Date, Age, Place),names_to = "trm") %>%
group_by(PID, Date) %>%
fill(Age, Place) %>%
mutate(trm = paste(trm,1:n(),sep="_")) %>%
ungroup %>%
pivot_wider(c(PID, Date, Age, Place), names_from = trm) %>%
rename_with(~ paste0("Drug",1:length(.x)), starts_with("Drug")) %>%
rename_with(~ paste0("Dose",1:length(.x)), starts_with("Dose")) %>%
mutate(across(starts_with("Dose"), as.numeric))
#> # A tibble: 5 × 16
#> PID Date Age Place Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Drug4 Dose4 Drug5
#> <chr> <chr> <int> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
#> 1 11A 25/10… 45 PMk RPG 12 NAT 34 QRT 5 BET 10 SET
#> 2 12B 20/10… 56 GTL ATY 13 LTP 3 CRT 3 <NA> NA <NA>
#> 3 13A 22/10… 45 RKS GGS 7 GSF 12 ERE 45 DFS 5 <NA>
#> 4 13A 26/10… 46 GLO BRT 9 ARR 4 GSF 34 <NA> NA <NA>
#> 5 14B 04/08… 25 MTK GDS 2 TRE 55 HHS 34 <NA> NA <NA>
#> # … with 3 more variables: Dose5 <dbl>, Drug6 <chr>, Dose6 <dbl>
Update:更新:
With the help of akrun see here: Use ~separate after mutate and across在 akrun 的帮助下,请参阅此处: 在 mutate 和cross 之后使用 ~separate
We could:我们可以:
library(dplyr)
library(stringr)
library(tidyr)
df %>%
group_by(PID) %>%
summarise(across(everything(), ~toString(.))) %>%
mutate(across(everything(), ~ list(tibble(col1 = .) %>%
separate(col1, into = str_c(cur_column(), 1:3), sep = ",\\s+", fill = "left", extra = "drop")))) %>%
unnest(c(PID, Row, Date, Drug1, Dose1, Drug2, Dose2, Drug3, Dose3, Age,
Place)) %>%
distinct() %>%
select(-1, -2)
PID3 Row1 Row2 Row3 Date1 Date2 Date3 Drug11 Drug12 Drug13 Dose11 Dose12 Dose13 Drug21 Drug22 Drug23 Dose21 Dose22 Dose23 Drug31 Drug32 Drug33 Dose31 Dose32 Dose33 Age1 Age2 Age3 Place1 Place2 Place3
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 11A NA 1 2 NA 25/10/2021 25/10/2021 NA RPG BET NA 12 10 NA NAT SET NA 34 43 NA QRT BLT NA 5 45 NA 45 NA NA PMk NA
2 12B NA NA 3 NA NA 20/10/2021 NA NA ATY NA NA 13 NA NA LTP NA NA 3 NA NA CRT NA NA 3 NA NA 56 NA NA GTL
3 13A 4 5 6 22/10/2021 26/10/2021 22/10/2021 GGS BRT DFS 7 9 5 GSF ARR NA 12 4 NA ERE GSF NA 45 34 NA 45 46 NA RKS GLO NA
4 14B NA NA 7 NA NA 04/08/2021 NA NA GDS NA NA 2 NA NA TRE NA NA 55 NA NA HHS NA NA 34 NA NA 25 NA NA MTK
First answer: Keeping the excellent explanation of @r2evans in mind!第一个答案:牢记@r2evans 的精彩解释! We could do it this way if really desired.
如果真的需要,我们可以这样做。
library(dplyr)
df %>%
group_by(PID) %>%
summarise(across(everything(), ~toString(.)))
output:输出:
PID Row Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 11A 1, 2 25/10/2021, 25/10/2021 RPG, BET 12, 10 NAT, SET 34, 43 QRT, BLT 5, 45 45, NA PMk, NA
2 12B 3 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
3 13A 4, 5, 6 22/10/2021, 26/10/2021, 22/10/2021 GGS, BRT, DFS 7, 9, 5 GSF, ARR, NA 12, 4, NA ERE, GSF, NA 45, 34, NA 45, 46, NA RKS, GLO, NA
4 14B 7 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK
a data.table
approach数据
data.table
方法
library(data.table)
DT <- fread("Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
2 11A 25/10/2021 BET 10 SET 43 BLT 45
3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
6 13A 22/10/2021 DFS 5
7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK")
dcast(DT)
DT
# Melt to long format
ans <- melt(DT, id.vars = c("PID", "Date"),
measure.vars = patterns(drug = "^Drug", dose = "^Dose"),
na.rm = TRUE)
# Paste and Collapse, use ; as separator
ans <- ans[, lapply(.SD, paste0, collapse = ";"), by = .(PID, Date)]
# Split string on ;
ans[, paste0("Drug", 1:length(tstrsplit(ans$drug, ";"))) := tstrsplit(drug, ";")]
ans[, paste0("Dose", 1:length(tstrsplit(ans$dose, ";"))) := tstrsplit(dose, ";")]
#join Age + Place data
ans[DT[!is.na(Age), ], `:=`(Age = i.Age, Place = i.Place), on = .(PID, Date)]
ans[, -c("variable", "drug", "dose")]
# PID Date Drug1 Drug2 Drug3 Drug4 Drug5 Drug6 Dose1 Dose2 Dose3 Dose4 Dose5 Dose6 Age Place
# 1: 11A 25/10/2021 RPG BET NAT SET QRT BLT 12 10 34 43 5 45 45 PMk
# 2: 12B 20/10/2021 ATY LTP CRT <NA> <NA> <NA> 13 3 3 <NA> <NA> <NA> 56 GTL
# 3: 13A 22/10/2021 GGS DFS GSF ERE <NA> <NA> 7 5 12 45 <NA> <NA> 45 RKS
# 4: 13A 26/10/2021 BRT ARR GSF <NA> <NA> <NA> 9 4 34 <NA> <NA> <NA> 46 GLO
# 5: 14B 04/08/2021 GDS TRE HHS <NA> <NA> <NA> 2 55 34 <NA> <NA> <NA> 25 MTK
Another answer to the festival.节日的另一个答案。
Reading data from this page从此页面读取数据
require(rvest)
require(tidyverse)
d = read_html("https://stackoverflow.com/q/69787018/694915") %>%
html_nodes("table") %>%
html_table(fill = TRUE)
List of dose per PID and DATE每个 PID 和 DATE 的剂量列表
# primera tabla
d[[1]] -> df
df %>%
pivot_longer(
cols = starts_with("Drug"),
values_to = "Drug"
) %>%
select( !name ) %>%
pivot_longer(
cols = starts_with("Dose"),
values_to = "Dose"
) %>%
select( !name ) %>%
drop_na() %>%
pivot_wider(
names_from = Drug,
values_from = Dose ,
values_fill = list(0)
) -> dose
Variable dose contains this data可变剂量包含此数据
( https://i.stack.imgur.com/lc3iN.png )
( https://i.stack.imgur.com/lc3iN.png )
Not that elegant as previous ones, but is an idea to see the whole treatment per PID.不像以前的那样优雅,但可以看到每个 PID 的整个处理过程。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.