使用列名称中的时间值重塑R中的数据

Question

I have a data frame which looks like this (simplified): 我有一个看起来像这样的数据框（简化）：

     data1.time1 data1.time2 data2.time1 data2.time2 data3.time1 group
 1          1.53        2.01        6.49        5.22        3.46    A
 ...
 24         2.12        3.14        4.96        4.89        3.81    C

where there are actually dataK.timeT for K in 1..27 and T in some (but maybe not all) of 1..8. 这里居然有dataK.timeT为K的1..27和T的1..8一些（但也许不是全部）。

I would like to rearrange the data into K data frames so that I can plot, for each K, the summary data (for now let's say mean and mean ± standard deviation) for each of the three groups A, B, and C. That is, I want 27 graphs with three lines per graph, and also marks for the deviations. 我想将数据重新排列为K个数据帧，以便可以为每个K绘制A，B和C三个组中每个组的摘要数据（现在让我们说均值和均值±标准差）。是的，我想要27个图形，每个图形有3条线，并标记偏差。

Once I rearrange the data it should be easy enough to collapse by group, compute summary statistics, etc. But I'm not really sure how to get the data into this form. 重新排列数据后，应该很容易按组折叠，计算摘要统计信息等。但是我不太确定如何将数据转换为这种形式。 I looked at the reshape package, which suggests melting it into a key-value store format and rearranging from there, but it doesn't seem to support the columns containing the T values as I have here. 我查看了reshape软件包，该软件包建议将其融合为键值存储格式并从那里重新排列，但是它似乎并不像我在这里那样支持包含T值的列。

Is there a good way to do this? 有什么好方法吗？ I'm quite willing to use something other than R to do this, since I can just import the results into R after transforming. 我非常愿意使用R以外的方法来执行此操作，因为我可以在转换后将结果导入R中。

Answer 1

After creating fake data with a structure similar to yours, we convert from wide to long format, making a "tidy" data frame that is ready for plotting with ggplot2 . 在创建具有与您相似的结构的伪数据后，我们从宽格式转换为长格式，从而制作了一个“整洁”的数据框，可以使用ggplot2进行绘制。

library(reshape2)
library(ggplot2)
library(dplyr)

Create fake data 创建虚假数据

set.seed(194)
dat = data.frame(replicate(27*8, cumsum(rnorm(24*3))))

names(dat) = paste0(rep(paste0("data",1:27), each=8), ".", rep(paste0("time",1:8), 27))

dat$group = rep(LETTERS[1:3], each=24)

Remove some columns so that number of time points will be different for different data sources: 删除一些列，以使不同数据源的时间点数不同：

dat = dat[ , -c(2,4,9,43,56,78,100:103,115:116,134:136,202,205)]

Reshape from wide to long format 从宽幅改到长幅

datl = melt(dat, id.var="group")

Split data source and time point into separate columns: 将数据源和时间点分为不同的列：

datl$source = gsub("(.*)\\..*","\\1", datl$variable)
datl$time = as.numeric(gsub(".*time(.*)","\\1", datl$variable))

# Order data frame names by number (rather than alphabetically)
datl$source = factor(datl$source, levels=paste0("data",1:length(unique(datl$source))))

Plot the data using ggplot2 使用ggplot2绘制数据

# Helper function for plotting standard deviation
sdFnc = function(x) {
  vals = c(mean(x) - sd(x), mean(x) + sd(x))
  names(vals) = c("ymin", "ymax")
  vals
}

pd = position_dodge(0.7)

ggplot(datl, aes(time, value, group=group, color=group)) + 
  stat_summary(fun.y=mean, geom="line", position=pd) +
  stat_summary(fun.data=sdFnc, geom="errorbar", width=0.4, position=pd) +
  stat_summary(fun.y=mean, geom="point", position=pd) +
  facet_wrap(~source, ncol=3) +
  theme_bw()

Original (unnecessarily complicated) reshaping code. 原始（不必要复杂）重塑代码。 (Note, this code will no longer work with the updated (fake) data set, because the number of time columns is no longer uniform): （请注意，此代码将不再与更新的（伪）数据集一起使用，因为时间列的数量不再统一）：

# Convert data source from wide to long
datl = data.frame()
for (i in seq(1,27*8,8)) {

  tmp.dat = dat[, c(i:(i+7),grep("group",names(dat)))]
  tmp.dat$source = gsub("(.*)\\..*", "\\1", names(tmp.dat)[1])
  names(tmp.dat)[1:8] = 1:8

  #datl = rbind(datl, tmp.dat)
  datl = bind_rows(datl, tmp.dat)  # Updated based on comment
}

datl$source = factor(datl$source, levels=paste0("data",1:27))

# Convert time from wide to long
datl = melt(datl, id.var = c("source","group"), variable.name="time")

Answer 2

Could do something like this with dplyr: 可以使用dplyr执行以下操作：

for(i in 1:K){ ## for 1:27
  my.data.ind <- paste0("data",i,"|group") ## "datai|group"
  one.month <- select(data, contains(my.data.ind) %>% ## grab cols that have these
                  group_by(group) %>% ## group by your group
                  summarise_each(funs(mean), funs(sd)) ## find mean for each col within each group
}

That should leave you with a 3xT data frame that has the average value of each group over time T 那应该留下一个3xT数据帧，该数据帧具有时间T上每个组的平均值

使用列名称中的时间值重塑R中的数据

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-05-20 18:12:13

Create fake data 创建虚假数据

Reshape from wide to long format 从宽幅改到长幅

Plot the data using ggplot2 使用ggplot2绘制数据

解决方案2
1 2016-05-20 17:43:49

使用列名称中的时间值重塑R中的数据

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-05-20 18:12:13

Create fake data 创建虚假数据

Reshape from wide to long format 从宽幅改到长幅

Plot the data using ggplot2 使用ggplot2绘制数据

解决方案2 1 2016-05-20 17:43:49

解决方案1
5 已采纳 2016-05-20 18:12:13

解决方案2
1 2016-05-20 17:43:49