简体   繁体   中英

Reshaping data using R or Excel

I have a dataset that contains many rows and 28 columns.

I need unique combinations of the subject ID and coc# columns, and the data that might be removed placed into extra columns. I might not be explaining this very well so I will show my example:

ID  DOB         address name            date seen   txdone  coc#
1   1/08/1997   4blelan bob sager   19/05/2002  1125    45555
1   1/08/1997   4blelan bob sager   19/05/2002  1200    45555
1   1/08/1997   4blelan bob sager   20/06/2003  2000    46666
1   1/08/1997   4blelan bob sager   20/06/2003  1222    46666
2   5/09/1956   55lala  Jim reads   19/05/2002  1125    55544
2   5/09/1956   55lala  Jim reads   19/05/2002  1111    55544
2   5/09/1956   55lala  Jim reads   1/06/2002   1111    55544
2   5/09/1956   55lala  Jim reads   2/07/2002   1353    56678

Transformed into this

ID  DOB         address name        dateseen1   txdone1 coc#1   dateseen2   txdone2 coc#2   date seen3  txdone3 coc#3
1   1/08/1997   4blelan bob sager   19/05/2002  1125    45555   19/05/2002  1200    45555           
1   1/08/1997   4blelan bob sager   20/06/2003  2000    46666   20/06/2003  1222    46666           
2   5/09/1956   55lala  Jim reads   19/05/2002  1125    55544   19/05/2002  1111    55544   1/06/2002   1111    55544
2   5/09/1956   55lala  Jim reads   2/07/2002   1353    56678

The reason for this is so I can search for 1125 in txdone but also get the other work that was carried out in that COC in one line. Looking at it now, I wouldn't even need multiple columns of coc just the one -- but you get the idea (maybe).

I am very open to doing things differently if I am going about this backwards. However, I am limited to using R and Excel.

In R, the package reshape2 should do the job. Try

require(reshape2)
melt(your_data_frame, id.vars=c("ID", "DOB", "address", "name"))

(You can play around with id.vars and measure.vars to get the exact reshaping you want.)

You will need something to make a unique "id" for each row. Here's a solution:

library(splitstackshape) ## For `getanID()`
library(reshape2)        ## For `melt()` and `dcast()`

idvars <- c("ID", "DOB", "address", "name", "coc")
mydf2 <- getanID(mydf, idvars)
dfL <- melt(mydf2, id.vars=c(idvars, ".id"))
dcast(dfL, ID + DOB + address + name + coc ~ variable + .id)
#   ID       DOB address      name   coc date.seen_1 date.seen_2 date.seen_3 txdone_1 txdone_2 txdone_3
# 1  1 1/08/1997 4blelan bob sager 45555  19/05/2002  19/05/2002        <NA>     1125     1200     <NA>
# 2  1 1/08/1997 4blelan bob sager 46666  20/06/2003  20/06/2003        <NA>     2000     1222     <NA>
# 3  2 5/09/1956  55lala Jim reads 55544  19/05/2002  19/05/2002   1/06/2002     1125     1111     1111
# 4  2 5/09/1956  55lala Jim reads 56678   2/07/2002        <NA>        <NA>     1353     <NA>     <NA>

You can rearrange the column orders later if you need to.


Alternatively, without melt ing to a long format first, after you create "mydf2", use reshape() from base R (and as a bonus, the columns are in the order you want).

reshape(mydf2, direction = "wide", idvar=idvars, timevar=".id")
#   ID       DOB address      name   coc date.seen.1 txdone.1 date.seen.2 txdone.2 date.seen.3 txdone.3
# 1  1 1/08/1997 4blelan bob sager 45555  19/05/2002     1125  19/05/2002     1200        <NA>       NA
# 3  1 1/08/1997 4blelan bob sager 46666  20/06/2003     2000  20/06/2003     1222        <NA>       NA
# 5  2 5/09/1956  55lala Jim reads 55544  19/05/2002     1125  19/05/2002     1111   1/06/2002     1111
# 8  2 5/09/1956  55lala Jim reads 56678   2/07/2002     1353        <NA>       NA        <NA>       NA

This is based on mydf being defined as:

mydf <- read.table(text = 'ID  DOB         address name            "date seen"   txdone  coc
1   1/08/1997   4blelan "bob sager"   19/05/2002  1125    45555
1   1/08/1997   4blelan "bob sager"   19/05/2002  1200    45555
1   1/08/1997   4blelan "bob sager"   20/06/2003  2000    46666
1   1/08/1997   4blelan "bob sager"   20/06/2003  1222    46666
2   5/09/1956   55lala  "Jim reads"   19/05/2002  1125    55544
2   5/09/1956   55lala  "Jim reads"   19/05/2002  1111    55544
2   5/09/1956   55lala  "Jim reads"   1/06/2002   1111    55544
2   5/09/1956   55lala  "Jim reads"   2/07/2002   1353    56678', header = TRUE)

If you don't want to install "splitstackshape" just for getanID (I promise I won't be offended), you can generate your .id variable manually as follows (which is essentially what getanID does anyway):

X <- do.call(paste, mydf[idvars])
mydf$.id <- ave(X, X, FUN = seq_along)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM