简体   繁体   English

如何在R中的多个步骤中收集列而不丢失分组

[英]How to gather columns in several steps in R without losing grouping

I need to convert a wide dataset to long and there are 16 columns which must converge to 4. Each 4 columns contain information related to one another and that information must not be "lost" in the transformation. 我需要将一个宽数据集转换为长数据集,并且有16列必须收敛到4列。每4列包含彼此相关的信息,并且该信息在转换中一定不能“丢失”。

I have data from a ranking task of four block which has essentially given me a data set where the information is divided into four groups in a wide format. 我有一个来自四个块的排名任务的数据,该任务实际上给了我一个数据集,其中信息以较宽的格式分为四个组。 Ie first_image, first_sex, first_score, second_image, second_sex, second_score... 即first_image,first_sex,first_score,second_image,second_sex,second_score ...

I have tried various combinations of group_by and gather() but I'm nowhere close. 我尝试了group_by和collect()的各种组合,但距离还很遥远。

I've already read Reshaping multiple sets of measurement columns (wide format) into single columns (long format) but I'm none the wiser I'm afraid. 我已经读过将多组测量列(宽格式)整形为单列(长格式),但恐怕没有哪一个更明智。

I've made some sample data of what one participant's data looks like and I've also made a sample of how I would like the data to look. 我已经制作了一些参与者数据的样例数据,并且还样例了我希望数据如何显示。


library(tidyverse)

sample_dat <- data.frame(subject_id = rep("sj1", 4),
                         first_pick = rep(1, 4),
                         first_image_pick = (c("a", "b", "c", "d")),
                         first_pick_neuro = rep("TD", 4),
                         first_pick_sex = rep("F", 4),
                         second_pick = rep(2, 4),
                         second_image_pick = (c("e", "f", "g", "h")),
                         second_pick_neuro = rep("TD", 4),
                         second_pick_sex = rep("M", 4),
                         third_pick = rep(3, 4),
                         third_image_pick = (c("i", "j", "k", "l")),
                         third_pick_neuro = rep("DS", 4),
                         third_pick_sex = rep("F", 4),
                         fourth_pick = rep(4, 4),
                         fourth_image_pick = (c("m", "n", "o", "p")),
                         fourth_pick_neuro = rep("DS", 4),
                         fourth_pick_sex = rep("M", 4))

Expected output: 预期产量:


final_data <- data.frame(subject_id = rep("sj1", 16),
                         image = c("a", "b", "c", "d",
                                   "e", "f", "g", "h",
                                   "i", "j", "k", "l",
                                   "m", "n", "o", "p"),
                         rank = rep(c(1, 2, 3, 4), each = 4), # from the numbers in the first_pick, second_pick etc. 
                         neuro = rep(c("TD", "DS"), each = 8),
                         sex = rep(c("F", "M", "F", "M"), each = 4))

So far I've tried this, however it only duplicate all the information: 到目前为止,我已经尝试过了,但是它只复制了所有信息:


sample_dat_long <- sample_dat %>%
  group_by(subject_id) %>%
  gather(Pick, Image,
         first_image_pick,
         second_image_pick,
         third_image_pick,
         fourth_image_pick)  

So essentially I don't want to lose the information for each image (pick, sex, neuro) when I gather my data. 因此,从本质上讲,我不想在收集数据时丢失每个图像(拾取,性别,神经)的信息。

Any help would be amazing! 任何帮助都将是惊人的!

We can do this with melt from data.table which can take multiple measure patterns to reshape from 'wide' to 'long' format. 我们可以使用data.table中的melt来做到这一点,它可以采用多种measure patterns从“宽”格式data.table为“长”格式。 Here, the column names with substrings 'image', 'neuro', 'sex' are reshaped into separate columns to get the expected output 在这里,带有子字符串“ image”,“ neuro”,“ sex”的列名称被重塑为单独的列,以获得预期的输出

library(data.table)
melt(setDT(sample_dat), measure = patterns("image", "neuro", "sex"), 
   value.name = c("image", "neuro", "sex"), variable.name = 'rank')[, 
    .(subject_id, rank, image, neuro, sex)]

I guess you can do it column by column, as you want only 4 columns in the end. 我想您可以逐列进行,因为最后只需要4列。 Getting the indices of the column that should be put together in the first one (if i understood correctly): 获取应该放在第一个索引中的列的索引(如果我理解正确的话):

  ind1 = seq(2,length(sample_dat[1,]), 4) 
  column1 = gather( sample_dat[,ind1] )[2]

And then repeat for all 3 other columns: 然后为所有其他3列重复:

  ind2 = seq(3,length(sample_dat[1,]), 4) 
  column2 = gather( sample_dat[,ind2] )[2]

And you could even do this 4 columns with a for loop to not do it "manually". 您甚至可以使用for循环来完成这4列,而不必“手动”执行。 And combine them back into a dataframe afterward 然后将它们组合回数据框

It would be worth to consider good column names (ie "<variable_chr>.<time_num>" ). 值得考虑使用"<variable_chr>.<time_num>"列名(即"<variable_chr>.<time_num>" )。 But we can fix it in a second. 但是我们可以在一秒钟内解决它。

pfx <- c("first", "second", "third", "fourth")

names(sample_dat)[-1] <- sapply(names(sample_dat)[-1], function(x) {
  x <- gsub("_pick", "", x)
  if (lengths(strsplit(x, "_")) == 2)
    sub("(^.*)_(.*)", paste("\\2", which(pfx == sub("(^.*)_.+", "\\1", x)), sep="."), x)
  else
    paste0("rank.", which(pfx == x))
})

names(sample_dat)  # good names now
# [1] "subject_id" "rank.1"     "image.1"    "neuro.1"    "sex.1"      "rank.2"    
# [7] "image.2"    "neuro.2"    "sex.2"      "rank.3"     "image.3"    "neuro.3"   
# [13] "sex.3"      "rank.4"     "image.4"    "neuro.4"    "sex.4" 

Thereafter we may easily use reshape . 此后,我们可以轻松地使用reshape

reshape(sample_dat, idvar="subject_id", varying=2:17, direction="long", 
        new.row.names=seq(ncol(sample_dat) - 1))
#    subject_id time rank image neuro sex
# 1         sj1    1    1     a    TD   F
# 2         sj1    1    1     b    TD   F
# 3         sj1    1    1     c    TD   F
# 4         sj1    1    1     d    TD   F
# 5         sj1    2    2     e    TD   M
# 6         sj1    2    2     f    TD   M
# 7         sj1    2    2     g    TD   M
# 8         sj1    2    2     h    TD   M
# 9         sj1    3    3     i    DS   F
# 10        sj1    3    3     j    DS   F
# 11        sj1    3    3     k    DS   F
# 12        sj1    3    3     l    DS   F
# 13        sj1    4    4     m    DS   M
# 14        sj1    4    4     n    DS   M
# 15        sj1    4    4     o    DS   M
# 16        sj1    4    4     p    DS   M

Data 数据

sample_dat <- structure(list(subject_id = structure(c(1L, 1L, 1L, 1L), .Label = "sj1", class = "factor"), 
    first_pick = c(1, 1, 1, 1), first_image_pick = structure(1:4, .Label = c("a", 
    "b", "c", "d"), class = "factor"), first_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "TD", class = "factor"), first_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "F", class = "factor"), second_pick = c(2, 
    2, 2, 2), second_image_pick = structure(1:4, .Label = c("e", 
    "f", "g", "h"), class = "factor"), second_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "TD", class = "factor"), second_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "M", class = "factor"), third_pick = c(3, 
    3, 3, 3), third_image_pick = structure(1:4, .Label = c("i", 
    "j", "k", "l"), class = "factor"), third_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "DS", class = "factor"), third_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "F", class = "factor"), fourth_pick = c(4, 
    4, 4, 4), fourth_image_pick = structure(1:4, .Label = c("m", 
    "n", "o", "p"), class = "factor"), fourth_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "DS", class = "factor"), fourth_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "M", class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM