简体   繁体   中英

Take messy dataframe with rows interspersed (i.e. a row bookending each chunk of observations) and move bookend row into a new column as identifier?

I have a messy dataframe where a row identifying a chunk of the previous observations (rows) bookends the last observation (row). This repeats.

A sample of the input dataframe I am working with:

val<-c("Settle0.90s10.0073", "Settle0.80s10.0070", "Settle0.70s10.0067","   
10/22/2110:51:4100PROG 0        PD    Pass        0.0150",
       "Settle0.90s10.0073", "Settle0.80s10.0070", "Settle0.70s10.0067",
       "10/22/2110:52:0300PROG 0        PD    Pass        0.0086")
type<-c("Settle", "Settle", "Settle",NA,"Settle", "Settle", "Settle",NA)
time_sec<-c(0.90,0.80,0.70,"10/22",0.90,0.80,0.70,"10/22")
sec<-c("s","s","s",NA,"s","s","s",NA)
press_psi<-c(10.0073,10.0070, 10.0067,NA,10.0073,10.0070, 10.0067,NA)

df<-data.frame(val,type,time_sec,press_psi)

I want to take that last row from each chunk of observations and use it as an identifier housed in a new column so it looks like this:

val<-c("Settle0.90s10.0073", "Settle0.80s10.0070", "Settle0.70s10.0067", 
       "Settle0.90s10.0073", "Settle0.80s10.0070", "Settle0.70s10.0067")
type<-c("Settle", "Settle", "Settle","Settle", "Settle", "Settle")
time_sec<-c(0.90,0.80,0.70,0.90,0.80,0.70)
sec<-c("s","s","s","s","s","s")
press_psi<-c(10.0073,10.0070, 10.0067,10.0073,10.0070, 10.0067)
part<-c("10/22/2110:51:4100PROG 0        PD    Pass        0.0150",
        "10/22/2110:51:4100PROG 0        PD    Pass        0.0150",
        "10/22/2110:51:4100PROG 0        PD    Pass        0.0150",
        "10/22/2110:52:0300PROG 0        PD    Pass        0.0086",
        "10/22/2110:52:0300PROG 0        PD    Pass        0.0086",
        "10/22/2110:52:0300PROG 0        PD    Pass        0.0086")
df<-data.frame(val,type,time_sec,press_psi,part)

Given that # of observations per 'part' section will vary over time, I was hoping to do some sort of approach, maybe using grepl to grab 'PD' in the messy row bookends, so that the script would work regardless of how many observations and/or 'part' sections. This worked but would require some more work to label all separate dataframes housed in the output list, giving them each a new 'part' identifier column and then rbind()-ing them.

lapply(split(df, cumsum(grepl("PD", df$val))), tail, -1)

But surely there is another way!

The messy dataframe is due to an unchangeable/non-reprogrammable piece of equipment, hence having to deal with the messiness. As you can see I already had to extract the various pieces of data from the 'val' column, which stores data as a messy string.

You can create a new column ( part ) with val value which has 'PD' in it or else append an NA and fill the NA with tidyr::fill . Use filter to remove the rows with 'PD' value.

library(dplyr)
library(tidyr)

df <- df %>%
  mutate(part = replace(val, !grepl("PD", val), NA)) %>%
  fill(part, .direction = 'up') %>%
  filter(!grepl("PD", val))

df

#                 val   type time_sec press_psi                                                     part
#1 Settle0.90s10.0073 Settle      0.9   10.0073 10/22/2110:51:4100PROG 0        PD    Pass        0.0150
#2 Settle0.80s10.0070 Settle      0.8   10.0070 10/22/2110:51:4100PROG 0        PD    Pass        0.0150
#3 Settle0.70s10.0067 Settle      0.7   10.0067 10/22/2110:51:4100PROG 0        PD    Pass        0.0150
#4 Settle0.90s10.0073 Settle      0.9   10.0073 10/22/2110:52:0300PROG 0        PD    Pass        0.0086
#5 Settle0.80s10.0070 Settle      0.8   10.0070 10/22/2110:52:0300PROG 0        PD    Pass        0.0086
#6 Settle0.70s10.0067 Settle      0.7   10.0067 10/22/2110:52:0300PROG 0        PD    Pass        0.0086

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM