简体   繁体   中英

Reshape data using dcast?

I don't know if using dcast() is the right way, but I want to reshape the following data.frame:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c(14,14,16))
df
   x y  z
1 p1 a 14
2 p1 b 14
3 p2 a 16

so that it looks like this one:

df2 <- data.frame(x=c("p1","p2"),a=c(1,1),b=c(1,0),z=c(14,16))
   x a b  z
1 p1 1 1 14
2 p2 1 0 16

The variable y in df should be broken so that its elements are new variables, each dummy coded. All other variables (in this case just z ) are equal for each person (p1,p2 etc.). The only variable where a specific person p has different values is y .
The reason I want this is because I need to merge this dataset with other ones by variable x . Thing is, it needs to be one row per person (p1,p2 etc).

The following works, but seems cumbersome.

df2 <- df
df2$y <- as.numeric(y)
df$y2 <- as.numeric(df$y)

df2 <- dcast(df, x+z~y, value.var="y2")

df2
   x  z a  b
1 p1 14 1  2
2 p2 16 1 NA

This is almost a duplicate of a previous question , and the same basic answer I used there works again. No need for any external packages either.

aggregate(model.matrix(~ y - 1, data=df),df[c("x","z")],max)

   x  z ya yb
1 p1 14  1  1
2 p2 16  1  0

To explain this, as it is a bit odd looking, the model.matrix call at its most basic returns a binary indicator variable for each unique value for each row of your data.frame, like so:

  ya yb
1  1  0
2  0  1
3  1  0

If you aggregate that intermediate result by your two id variables ( x and z ), you are then essentially acting on the initial data.frame of:

   x  z ya yb
1 p1 14  1  0
2 p1 14  0  1
3 p2 16  1  0

So if you take the max value of ya and yb within each combination of x and z , you basically do:

   x  z ya      yb
1 p1 14  1*max*  0
2 p1 14  0       1*max*

--collapse--

   x  z ya      yb
1 p1 14  1       1

...and repeat that for each unique x / z combination to give the final result:

   x  z ya yb
1 p1 14  1  1
2 p2 16  1  0

Things get a bit crazy to generalise this to more columns, but it can be done, courtesy of this question eg:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c("14","15","16"))
intm <- model.matrix(~ y + z - 1, data=df,
                 contrasts.arg = sapply(df[2:3], contrasts, contrasts=FALSE))
aggregate(intm,df[c("x")],max)

   x ya yb z14 z15 z16
1 p1  1  1   1   1   0
2 p2  1  0   0   0   1

I'm not sure much of this you have to do but if you need a way to automate it, I wrote this little function that might help:

First run dcast:

new = dcast(df, x+z~y, value.var="y")

Load into your R environment:

 # args to be passed: 
 # df is your dataframe 
 # cols is a list of format c("colname1", "colname2", ... , "colnameN")
    binarizeCols = function(df, cols){
      for(i in cols){
        column = which(colnames(df) == i)
        truthRow = is.na(df[,column])
        for(j in 1:length(truthRow)){
          if(truthRow[j] == FALSE){
            df[j,column] = 1
          }else{
             df[j,column] = 0
           }
        }
      }
      return(df)
    }

then run:

new = binarizeCols(new, c("a", "b"))

and you get:

     x  z  a  b
   1 p1 14 1  1 
   2 p2 16 1  0

not as fast as using _apply() but there's no hardcoding, you can enter any colnames you want (maybe you want to skip one in the middle?) and you dont create a new instance of your df. note: I use "=" instead of "<-" because I thought it was being phased out but they can be replaced if need be.

df <- data.frame(x=c("p1","p1","p2","p3"),
                 y=c("a","b","a","c"),
                 z=c(14,14,16,17))  # wanted larger test case.
new <- dcast(df, x+z~y, value.var="y")
new[3:5] <- sapply(lapply(new[3:5], '%in%', unique(df$y) ), as.numeric)
new
   x  z a b c
1 p1 14 1 1 0
2 p2 16 1 0 0
3 p3 17 0 0 1

First check for containment in a vector that summarizes the possible values to create columns of logical values. Then 'dummify' by taking as.numeric of those logical values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM