简体   繁体   English

如何将水平行调整为跨大数据集的多行

[英]How to adjust horizontal row to become multiple rows across largedataset

We have a large dataset which we would like to edit and analyse but before we can begin we need to transpose the data into a more functional format for statistical analysis. 我们有一个庞大的数据集,我们想对其进行编辑和分析,但在开始之前,我们需要将数据转换为更实用的格式以进行统计分析。

````Incorrect format dataframe
library(tidyverse)
data <-
 tribble(~id, ~date, ~start, ~end, ~start, ~end, ~start, ~end,
         1001, "01/07/2019", "04:00", "08:00", "10:00", "15:00", "16:00", "20:00",
         1001, "02/07/2019", "04:30", "05:30", "09:00", "14:00", "17:00", "21:00",
         1009, "05/07/2019", "03:00", "05:00", "07:00", "14:00", "15:00", "19:00",
         1009, "07/07/2019", "03:30", "04:30", "08:20", "15:20", "16:30", "20:30") 

````Correct format dataframe
# id date start end
# 1001 01/07/2019 04:00 08:00
# 1001 01/07/2019 10:00 15:00
# 1001 01/07/2019 16:00 20:00
# 1001 02/07/2019 04:30 05:30
# 1001 02/07/2019 09:00 14:00
# 1001 02/07/2019 17:00 21:00
# 1009 05/07/2019 03:00 05:00
# 1009 05/07/2019 07:00 14:00
# 1009 05/07/2019 15:00 19:00
# 1009 07/07/2019 03:30 04:30
# 1009 07/07/2019 08:20 15:20
# 1009 07/07/2019 16:30 20:30

I can manipulate my data manually but I've been unable to conduct automated function. 我可以手动操作数据,但无法执行自动功能。 The actual dataset has 32 columns across and 10,000 rows. 实际的数据集包含32列和10,000行。 Edit: I've tried to concatenate id and date to every value and sort, but have made mistakes with this method. 编辑:我试图将ID和日期连接到每个值和排序,但是使用此方法出错。

Next time it would be great if you could poste a reproducible example of your data (like the one in my code below). 下次,如果您可以发布数据的可重现示例(如下面我的代码中的示例),那就太好了。

It looks like what you want to do is to turn your data from a wide into some kind of long format. 看来您想要做的就是将数据从宽格式转换为长格式。 The duplicated column names are causing some trouble but the code below should do the trick. 重复的列名引起了一些麻烦,但是下面的代码可以解决问题。 You will have to install the tidyverse package for this: 您将必须为此安装tidyverse软件包:

library(tidyverse)
  data <-
     tribble(~id, ~date, ~start, ~end, ~start, ~end, ~start, ~end,
             1001, "01/07/2019", "04:00", "08:00", "10:00", "15:00", "16:00", "20:00",
             1001, "02/07/2019", "04:30", "05:30", "09:00", "14:00", "17:00", "21:00",
             1009, "05/07/2019", "03:00", "05:00", "07:00", "14:00", "15:00", "19:00",
             1009, "07/07/2019", "03:30", "04:30", "08:20", "15:20", "16:30", "20:30") 
  # make column names unique
  names(data) <-
     ifelse(names(data) %in% c("start","end"),
            paste0(names(data),"_",1:length(names(data))),
           names(data))

  # turn data into long format
  data %>%
     gather(key,value,-id,-date) %>%
     arrange(id,date) %>%
     # get rid of the column suffixes
     mutate(key = str_replace_all(key,pattern = c("_\\d+"=""))) %>% 
     group_by(id,date,key) %>% 
     mutate(obs_id = row_number()) %>% 
     spread(key,value) %>% 
     ungroup() %>% 
     select(id,
            date,
            start,
            end)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM