简体   繁体   English

将数据帧转换为 R 中的一列长格式

[英]Convert dataframe into long format one column in R

I have data as given in input section (dput below), need to convert to output with all values of two rows in one long column.我有输入部分(下面的 dput)中给出的数据,需要转换为输出,其中包含一长列中两行的所有值。 I tried using transpose but cells were getting trimmed.我尝试使用转置,但细胞被修剪。

I don't want to hardcode since in future I might have data in 3 or 4 rows in a similar way.我不想硬编码,因为将来我可能会以类似的方式将数据分成 3 或 4 行。

PS - I also tried pivot_longer but it didnt help PS - 我也试过 pivot_longer 但它没有帮助

    structure(list(Header = c("Sat 12/3   \n358a-947a\n1017a-229p HRS   10.02", 
"Sat 12/10  \n559a-1106a\n1134a-227p HRS         8.00"), X = c("Sun 12/4   ", 
"Sun 12/11  "), X.1 = c("Mon 12/5   \n548a-1121a\n1149a-618p\n650p-845p HRS   13.95", 
"Mon 12/12  \n500a-1121a\n1151a-547p\n616p-830p HRS   14.53"), 
    X.2 = c("Tue 12/6   \n359a-1120a\n1150a-400p HRS   11.53", 
    "Tue 12/13  \n548a-1120a\n1148a-449p HRS   10.54"), X.3 = c("Wed 12/7   \n548a-1119a\n1149a-515p HRS   10.95", 
    "Wed 12/14  \n429a-1120a\n1150a-432p HRS   11.56"), X.4 = c("Thu 12/8   \n549a-1120a\n1149a-447p HRS   10.48", 
    "Thu 12/15  \n429a-1121a\n1152a-431p HRS   11.52"), X.5 = c("Fri 12/9   \n548a-1120a\n1148a-218p HRS         8.03", 
    "Fri 12/16  \n430a-1120a\n1150a-432p HRS   11.55")), class = "data.frame", row.names = c(NA, 
-2L))

My try (with a little help)我的尝试(有一点帮助)

pivot_longer(df, cols = c(1:7)) %>%    
  select(value) %>% 
  mutate(value=str_replace(value,"HRS","")) %>% 
  separate(.,value,into=c("day","entry1","entry2","entry3"),sep="\n") %>% 
  separate(.,entry1,into=c("time_in1","time_out1"),sep="-") %>% 
  separate(.,entry2,into=c("time_in2","time_out2"),sep="-") %>% 
  separate(.,time_out2,into=c("time_out2","duration1"),remove = FALSE,sep="    ",extra = "merge") %>% 
  separate(.,entry3,into=c("time_in3","time_out3"),sep="-") %>% 
  separate(.,time_out3,into=c("time_out3","duration2"),remove = FALSE,sep="    ") %>%
  mutate(duration=coalesce(duration1,duration2)) %>% 
  select(day, duration, time_in1,time_out1,time_in2,time_out2,time_in3,time_out3) %>% 
  separate(.,day,into=c("date","day"),extra="merge") %>% 
  mutate(day=mdy(paste0(day,"2021")),
         duration=str_trim(duration))

预期输出 所需输出

Approach方法

The key was tidyr::separate_rows() , which not only separates the cell by "\\n" but also splits the components into rows rather than columns.关键是tidyr::separate_rows() ,它不仅用"\\n"分隔单元格,而且还将组件分成而不是列。

Here, it is much better to split into rows than into columns .在这里,拆分成行比拆分成列好得多 Suppose that most cells have 2 or 3 entries separated by "\\n" ;假设大多数单元格有 2 或 3 个条目,由"\\n"分隔; but there is a "rogue" cell, with an unusually large number (say 9) of entries, generated by someone who repeatedly clocked in and out throughout the day.但是有一个“流氓”单元格,其中包含异常多的条目(比如 9 个),是由一整天反复打卡打卡的人生成的。

While splitting into columns would create arbitrarily many time_in* | time_out*虽然拆分成列会创建任意多个time_in* | time_out* time_in* | time_out* columns, which remain empty ( NA ) in all rows except the "rogue" time_in* | time_out*列,在除“流氓”之外的所有行中保持为空( NA

   date  day        duration time_in1 time_out1 time_in2 time_out2 time_in3 time_out3 time_in4 time_out4 time_in5 time_out5 time_in6 time_out6 time_in7 time_out7 time_in8 time_out8 time_in9 time_out9
   <chr> <date>     <chr>    <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>
#  ...   ...        ...      ...      ...       ...      ...       ...      ...       NA       NA        NA       NA        NA       NA        NA       NA        NA       NA        NA       NA        

splitting into rows will maintain a tame (and stable) columnar structure分成行将保持温和(和稳定)的柱状结构

   date       day      duration time_in time_out
   <date>     <chr>       <dbl> <chr>   <chr>   
#  ...        ...      ...      ...     ...
#  ...        ...      ...      ...     ...
#  ...        ...      ...      ...     ...

without any "extraneous" columns (or rows).没有任何“无关”的列(或行)。

Solution解决方案

Given your sample data df鉴于您的样本数据df

df <- structure(list(Header = c("Sat 12/3   \n358a-947a\n1017a-229p HRS   10.02", "Sat 12/10  \n559a-1106a\n1134a-227p HRS         8.00"),
                     X = c("Sun 12/4   ", "Sun 12/11  "),
                     X.1 = c("Mon 12/5   \n548a-1121a\n1149a-618p\n650p-845p HRS   13.95", "Mon 12/12  \n500a-1121a\n1151a-547p\n616p-830p HRS   14.53"), 
                     X.2 = c("Tue 12/6   \n359a-1120a\n1150a-400p HRS   11.53", "Tue 12/13  \n548a-1120a\n1148a-449p HRS   10.54"),
                     X.3 = c("Wed 12/7   \n548a-1119a\n1149a-515p HRS   10.95", "Wed 12/14  \n429a-1120a\n1150a-432p HRS   11.56"),
                     X.4 = c("Thu 12/8   \n549a-1120a\n1149a-447p HRS   10.48", "Thu 12/15  \n429a-1121a\n1152a-431p HRS   11.52"),
                     X.5 = c("Fri 12/9   \n548a-1120a\n1148a-218p HRS         8.03", "Fri 12/16  \n430a-1120a\n1150a-432p HRS   11.55")),
                class = "data.frame", row.names = c(NA, -2L))

the following workflow以下工作流程

library(tidyverse)
library(stringr)


# ...
# Code to generate 'df'.
# ...


year_observed <- 2016


results <- df %>%
  mutate(id = row_number()) %>%
  pivot_longer(!id, names_to = "column") %>%
  separate(value, into = c("date", "entries"), sep = "\n", fill = "right", extra = "merge", remove = TRUE) %>%
  separate(entries, into = c("times", "duration"), sep = "HRS", fill = "right", extra = "warn", remove = TRUE) %>%
  mutate(across(date:duration, trimws),
         date = as.Date(paste(str_extract(date, "\\d{1,2}/\\d{1,2}$"), year_observed, sep = "/"), format = "%m/%d/%Y"),
         duration = as.numeric(duration),
         duration = if_else(is.na(duration), 0, duration),
         day = format(date, format = "%a")) %>%
  separate_rows(times, sep = "\n") %>%
  separate(times, into = c("time_in", "time_out"), sep = "-", fill = "warn", extra = "warn", remove = TRUE) %>%
  # ...Further Transformations... %>%
  select(id, date, day, duration, time_in, time_out)


# View results.
results

should yield results like应该产生类似的results

# A tibble: 28 x 6
      id date       day   duration time_in time_out
   <int> <date>     <chr>    <dbl> <chr>   <chr>   
 1     1 2016-12-03 Sat       10.0 358a    947a    
 2     1 2016-12-03 Sat       10.0 1017a   229p    
 3     1 2016-12-04 Sun        0   NA      NA      
 4     1 2016-12-05 Mon       14.0 548a    1121a   
 5     1 2016-12-05 Mon       14.0 1149a   618p    
 6     1 2016-12-05 Mon       14.0 650p    845p    
 7     1 2016-12-06 Tue       11.5 359a    1120a   
 8     1 2016-12-06 Tue       11.5 1150a   400p    
 9     1 2016-12-07 Wed       11.0 548a    1119a   
10     1 2016-12-07 Wed       11.0 1149a   515p    
# ... with 18 more rows

where id identifies (by row number) the original record in df .其中id标识(按行号) df的原始记录。

To pivot into your newly specified output , simply execute this code, or append it to the existing workflow:要转入新指定的 output ,只需执行此代码,或将其附加到现有工作流:

wide_results <- results %>%
  group_by(id, date) %>% mutate(entry = row_number()) %>% ungroup() %>%
  pivot_wider(id_cols = c(date, day, duration), names_from = entry, names_glue = "{.value}_{entry}", values_from = c(time_in, time_out)) %>%
  # Select so as to alternate between 'time_in_*' and 'time_out_*'.
  select(order(as.numeric(str_extract(colnames(.), "\\d+$")), str_extract(colnames(.), "^time_(in|out)"), na.last = FALSE))


# View results.
wide_results

You should obtain wide_results like:您应该获得wide_results如:

# A tibble: 14 x 9
   date       day   duration time_in_1 time_out_1 time_in_2 time_out_2 time_in_3 time_out_3
   <date>     <chr>    <dbl> <chr>     <chr>      <chr>     <chr>      <chr>     <chr>     
 1 2016-12-03 Sat      10.0  358a      947a       1017a     229p       NA        NA        
 2 2016-12-04 Sun       0    NA        NA         NA        NA         NA        NA        
 3 2016-12-05 Mon      14.0  548a      1121a      1149a     618p       650p      845p      
 4 2016-12-06 Tue      11.5  359a      1120a      1150a     400p       NA        NA        
 5 2016-12-07 Wed      11.0  548a      1119a      1149a     515p       NA        NA        
 6 2016-12-08 Thu      10.5  549a      1120a      1149a     447p       NA        NA        
 7 2016-12-09 Fri       8.03 548a      1120a      1148a     218p       NA        NA        
 8 2016-12-10 Sat       8    559a      1106a      1134a     227p       NA        NA        
 9 2016-12-11 Sun       0    NA        NA         NA        NA         NA        NA        
10 2016-12-12 Mon      14.5  500a      1121a      1151a     547p       616p      830p      
11 2016-12-13 Tue      10.5  548a      1120a      1148a     449p       NA        NA        
12 2016-12-14 Wed      11.6  429a      1120a      1150a     432p       NA        NA        
13 2016-12-15 Thu      11.5  429a      1121a      1152a     431p       NA        NA        
14 2016-12-16 Fri      11.6  430a      1120a      1150a     432p       NA        NA        

Note笔记

You must supply the year_observed (here 2016 ) to correctly contextualize the dates written in m/d format.您必须提供year_observed (此处为2016 )以正确地将以m/d格式编写的日期year_observed Otherwise, they will calibrate to the year 2021 , which will skew the day s of the week.否则,他们将校准到2021 ,这将歪曲一周中的day

Warning警告

These dates ( 12/3 , etc.) are in December, and close to the end of the calendar year.这些日期( 12/3等)是在 12 月,接近日历年的结束。 If any of these entries "cross over" (from 2016 ) into the next year (ex. 1/1/2017 ), they will be incorrectly calibrated to the former year (ex. 1/1/2016 ), and thus have an incorrect date and week day .如果这些条目中的任何一个“交叉”(从2016 )进入下一年(例如1/1/2017 ),它们将被错误地校准到前一年(例如1/1/2016 ),因此有一个不正确date和星期day

However, if your dates do cross over, that's a good indication that the full date ( 12/3/2016 ) should have been notated in the original cells, in which case但是,如果您的日期确实交叉,则表明完整日期 ( 12/3/2016 ) 应该已在原始单元格中注明,在这种情况下

results <- df %>%
  # ... %>%
  mutate(
    # ...
    date = as.Date(str_extract(date, "(\\d{1,2}/){2,2}\\d{4,4}$"), format = "%m/%d/%Y")
    # ...
  ) # ... %>%

would have sufficed to properly parse the date s.就足以正确解析date

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM