简体   繁体   中英

How to cast “long-ish” data to wide format in R, dropping some values of some wide-column variables?

My data is in a not-quite-wide format where most variables have their own columns but the last two ( point_type and point_value ) are long. Like this:

strain light hours point_type point_value
------ ----- ----- ---------- -----------
blank  25    17    log_od750  -0.9018948
blank  25    17    log_od750  -2.0778428
blank  25    17    log_od750  -1.6129440
blank  25    17    log_od750  -2.0076792
blank  25    17    log_od750  -2.8913722
blank  25    17    log_od750  -2.4292827

How can I extract a wide-format data frame with points of a few specific types? For example I want the columns strain , light , hours , od750 , and log_od750 , and to drop any other point_type s. I assume the easiest way is to subset like data <- subset(data, point_type %in% c('od750', 'log_od750')) first, then reshape?

I've tried various dcast formulas but don't really get what I'm doing. Do I need to melt it the rest of the way first? And if so how to I prevent duplicates from being aggregated on recasting?

I realize there are a lot of reshaping questions and answers out there! But I haven't been able to figure it out from them all morning and I'm not even sure which ones are relevant to my problem.

EDIT: better sample data.

set.seed(1)
data <- data.frame(
  strain=sample(paste0('strain', 1:10), 100, replace=TRUE),
  light=sample(100),
  hours=sample(100),
  point_type=sample(c('od750', 'log_od750', 'loess', 'loess_err',
                      'locfit', 'locfit_err'), 100, replace=TRUE),
  point_value=sample(100)
)

(The columns I want to reshape together will have matching hours values, I'm just not sure how to write a sample call that demonstrates that.)

EDIT: eipi10's answer almost works, but I think I need to additionally work around the inability to keep duplicate values in dcast by adding an ID column, as shown in several other questions:

But after reading + trying those examples I'm still not sure how to construct that column.

EDIT: Working solution based on eipi10's answer and this one :

cast_points <- function(df, ptype_regex) {
  # takes a data frame in "long-ish" format
  # (identifying columns + point_type, point_value),
  # and extracts a wide-format data frame
  # note: no aggregation is done, which makes sense if your data contains
  #       lots of replicate measurements you want to keep separate
  # note: this also works best when you have values for each chosen point_type
  #       for each combination of other columns. if not, you get lots of NAs
  idcols <- colnames(df)[!colnames(df) %in% c('point_type', 'point_value')]
  ptypes <- unique(grep(ptype_regex, df[['point_type']], value=TRUE))
  gbdots <- lapply(c(idcols, 'point_type'), as.symbol)
  dcform <- paste(paste(c(idcols, 'counter'), collapse='+'), '~', 'point_type')
  df <- df[df[['point_type']] %in% ptypes,]
  df <- df %>%
    group_by_(.dots=gbdots) %>%
    mutate(counter=1:n()) %>%
    dcast(formula=dcform, value.var="point_value", drop=TRUE)
  df[['counter']] <- NULL
  return(df)
}

head(data)
#   strain resistance light  light_bin hours point_type point_value
# 1  blank       <NA>    25 025-025 uE    17      od750      0.4058
# 2  blank       <NA>    25 025-025 uE    17      od750      0.1252
# 3  blank       <NA>    25 025-025 uE    17      od750      0.1993
# 4  blank       <NA>    25 025-025 uE    17      od750      0.1343
# 5  blank       <NA>    25 025-025 uE    17      od750      0.0555
# 6  blank       <NA>    25 025-025 uE    17      od750      0.0881
tail(data)
#        strain resistance light  light_bin hours point_type point_value
# 93516 strain1         Km    25 025-025 uE    17  log_od750  -0.9670579
# 93517 strain1         Km    25 025-025 uE    17  log_od750  -1.4605870
# 93540 strain1         Km    25 025-025 uE    17  log_od750  -1.8300846
# 93542 strain1         Km    25 025-025 uE    17  log_od750  -1.1779802
# 93554 strain1         Km    25 025-025 uE    17  log_od750  -2.0448469
# 93556 strain1         Km    25 025-025 uE    17  log_od750  -1.8413700

head(cast_points(data, 'log'))
#   strain resistance light  light_bin hours  log_od750
# 1  blank       <NA>    25 025-025 uE    17 -0.9018948
# 2  blank       <NA>    25 025-025 uE    17 -2.0778428
# 3  blank       <NA>    25 025-025 uE    17 -1.6129440
# 4  blank       <NA>    25 025-025 uE    17 -2.0076792
# 5  blank       <NA>    25 025-025 uE    17 -2.8913722
# 6  blank       <NA>    25 025-025 uE    17 -2.4292827

head(cast_points(data, 'od'))
#   strain resistance light  light_bin hours  log_od750  od750
# 1  blank       <NA>    25 025-025 uE    17 -0.9018948 0.4058
# 2  blank       <NA>    25 025-025 uE    17 -2.0778428 0.1252
# 3  blank       <NA>    25 025-025 uE    17 -1.6129440 0.1993
# 4  blank       <NA>    25 025-025 uE    17 -2.0076792 0.1343
# 5  blank       <NA>    25 025-025 uE    17 -2.8913722 0.0555
# 6  blank       <NA>    25 025-025 uE    17 -2.4292827 0.0881

EDIT: In case anyone else got this far, here are a couple other functions for manipulating the same kind of data:

remove_points <- function(df, ptype_regex) {
  # takes a data frame in "long-ish" format
  # (identifying columns + point_type, point_value),
  # and removes rows with point_types matching the regex
  # useful for removing points that you just extracted with cast_points
  ptypes <- unique(grep(ptype_regex, df[['point_type']], value=TRUE))
  df[!df[['point_type']] %in% ptypes,]
}

append_points <- function(df1_longish, df2_wide) {
  # takes one main data frame in "long-ish" format and a second in wide format
  # melts the wide one into long-ish and appends to the main one
  idcols <- colnames(df1_longish)[!colnames(df1_longish)
                                  %in% c('point_type', 'point_value')]
  df2_longish <- melt(df2, variable.name='point_type',
                              value.name='point_value', id.vars=idcols)
  rbind(df1_longish, df2_longish)
}

sac_points <- function(df, fn, ptype_regex) {
  # split-apply-combine for "long-ish" data
  misc <- drop_points(df, ptype_regex)
  df   <- cast_points(df, ptype_regex)
  df   <- fn(df)
  append_points(misc, df)
}

It looks like you've got two separate issues: You want to keep only some levels of point_type , and you then want the levels of point_type to become columns. If so, you can do this:

library(reshape2)

# Subset
data.keep = data[grep("od750|loc", data$point_type), ]

# Convert to wide format
data.keep.wide = dcast(data.keep, strain + light + hours ~ point_type, 
                       value.var="point_value") 

The first line of code will keep all rows for which point_type contains the text "od750" or "loc" (I just chose these for illustration). The second line converts the remaining levels of point_type to columns.

Here the first few lines of the result:

data.keep.wide

     strain light hours locfit locfit_err log_od750 od750
1   strain1    24    48     NA         60        NA    NA
2   strain1    67    74     16         NA        NA    NA
3   strain1    78    20      7         NA        NA    NA
4   strain1    83     5     NA         NA        58    NA
5   strain1    95    84     NA         NA        NA    47
6  strain10    13    53     NA         NA        35    NA
7  strain10    42    90     NA         NA        NA    78

Note that in each row only one of the new columns contains a value. Based on the information in your question, I'm assuming that your real data has a point_value for each level of point_type for each hour.

UPDATE: Your comments seem to indicate that your data can have multiple rows for each unique combination of strain, light, hours and point_type. If that's the case, and you want to keep all of them, then you can do the following after subsetting. The code below groups the data by all the relevant columns, then adds counter , which provides a unique "ID" to each row of a given group. Then we include counter on the LHS of dcast to keep all rows, without aggregation:

library(reshape2)
library(dplyr)

# Convert to wide format, keeping all rows
data.keep.wide = data.keep %>%
  group_by(strain, light, hours, point_type) %>%
  mutate(counter=1:n()) %>%
  dcast(strain + light + hours + counter ~ point_type, value.var="point_value")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM