I have a large dataset that contains data about patients. Some patients have multiple rows and I want to combine these rows, so that each patient has one row.
I have about 20 different variables. Some variables need to stay the same when combining rows (eg, a patient with 4 rows that is in group 1, should still be in group 1 when the rows are combined), but I have also variables that have to meet a certain condition (eg, if a patient had surgery in one (or multiple) of the rows, it should become a 'yes'. If not, it should become a 'no').
I have tried searching for the answer, but I am confused. I tried using plyr, but it seems that using this function is not recommended, as it becomes slow with very large datasets. I have found some information about dplyr, but I am not understanding how I should use this.
So for example, I have the following dataset (my apologies for how I present this, I am new to Stackoverflow)
**Patient_Id** /**Group** /**Age** /**Gender** /**surgery y/n** /**no of surgeries**
1 - 1 - 63 - F - no - 0
1 - 1 - 63 - F - no - 0
1 - 1 - 64 - F - yes - 1
2 - 0 - 60 - M - yes - 2
3 - 1 - 65 - M - no - 0
4 - 0 - 60 - F - no - 0
4 - 0 - 61 - F - yes - 1
4 - 0 - 62 - F - yes - 1
And I want to make a dataframe like this
**Patient_Id** /**Group** /**Age** /**Gender** /**surgery y/n** /**no of surgeries**
1 - 1 - 63,33 - F - yes - 1
2 - 0 - 60 - M - yes - 2
3 - 1 - 65 - M - no - 0
4 - 0 - 61 - F - yes - 2
Does anyone know what function would be best to use? Or how to start? Thank you in advance!
Data in dput
format.
df1 <-
structure(list(Patient_Id = c(1, 1, 1, 2, 3, 4, 4, 4),
Group = c(1, 1, 1, 0, 1, 0, 0, 0), Age = c(63, 63, 64,
60, 65, 60, 61, 62), Gender = c("F", "F", "F", "M",
"M", "F", "F", "F"), `surgery y/n` = c("no", "no", "yes",
"yes", "no", "no", "yes", "yes"), `no of surgeries` = c(0L,
0L, 1L, 2L, 0L, 0L, 1L, 1L)), row.names = c(NA, -8L),
class = "data.frame")
df2 <-
structure(list(Patient_Id = c(1, 2, 3, 4),
Group = c(1, 0, 1, 0), Age = c("63,33",
"60", "65", "61"), Gender = c("F", "M",
"M", "F"), `surgery y/n` = c("yes", "yes",
"no", "yes"), `no of surgeries` = c(1, 2,
0, 2)), row.names = c(NA, -4L),
class = "data.frame")
The structure of my dataframe is as followed:
str( SMARTdata_50j_diagc_2016 ) 'data.frame': 458794 obs. of 20 variables:
$ Groep : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 2 2 2 2 ...
$ Ziekenhuis_Nr : Factor w/ 13 levels "1","10","11",..: 2 8 4 11 3 7 10 9 13 6 ...
$ Ziekenhuistype : Factor w/ 3 levels "0","1","2": 2 2 2 2 1 1 2 1 2 3 ...
$ Patient_Id : num 85550 101414 239946 291650 140558 ...
$ DBC_Id : num 181394 230887 448945 524873 251352 ...
$ Diagnose_Code : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Zorgtype_Code : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 1 1 2 ...
$ Lft_patient_openenDBC : num 50 80 66 60 67 64 54 71 70 76 ...
$ Geslacht : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 1 ...
$ MRI_nee_ja : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
$ MRI_Aantal : num 0 0 0 1 0 0 0 0 0 0 ...
$ Artroscopie_nee_jaz_jam : Factor w/ 3 levels "0","1","2": 1 1 1 3 1 1 1 1 1 1 ...
$ Artroscopie_aantal : num 0 0 0 1 0 0 0 0 0 0 ...
$ Jaar_openen_DBC : num 2016 2017 2018 2017 2017 ...
$ Mnd_openen_DBC : num 12 5 6 2 5 8 10 11 1 1 ...
$ Jaar_sluiten_DBC : num 2017 2017 2018 2017 2017 ...
$ Mnd_sluiten_DBC : num 4 9 10 4 9 12 2 3 4 5 ...
$ Aantal_overigeDBC_bijopenen: num 1 1 2 1 0 0 1 0 0 0 ...
$ open_DBC : 'yearmon' num Dec 2016 May 2017 Jun 2018 Feb 2017 ...
$ sluiten_DBC : 'yearmon' num Apr 2017 Sep 2017 Oct 2018 Apr 2017 ...
Your question is straight forward. One way to do it via dplyr
package would be,
library(dplyr)
df1 %>%
group_by(Patient_Id) %>%
summarise(Group = first(Group),
Age = mean(Age),
Gender = first(Gender),
`no of surgeries` = sum(`no of surgeries`),
`surgery y/n` = ifelse(`no of surgeries` == 0, 'no', 'yes'))
which gives,
# A tibble: 4 x 6 Patient_Id Group Age Gender `no of surgeries` `surgery y/n` <dbl> <dbl> <dbl> <chr> <int> <chr> 1 1 1 63.3 F 1 yes 2 2 0 60 M 2 yes 3 3 1 65 M 0 no 4 4 0 61 F 2 yes
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.