简体   繁体   中英

splitting contents of dataframe column into different columns based on values

I am trying to split the following dataframe column into 3 columns depending on what the contents are. I tried using dplyr and mutate because I wanted to learn them better, but any suggestions would be welcome.

exampledf<-data.frame(c("Argentina","2005/12","2005/11","Bolivia","2006/12"),stringsAsFactors=F)
mutate(exampledf,month=strsplit(exampledf[,1],"/")[1],month=strsplit(exampledf[,1],"/")[2])

My Goal:

Year     Month    Country
2005     12       Argentina
2005     11       Argentina
2006     12       Bolivia

This is very close to this SO post, but it doesnt address my repeating country issue.

We create a logical index for rows that have no numbers ('i1'), get the cumulative sum of that, split the dataset with that grouping index, extract the 'year', 'month' with ( sub ), and the 'Country' as the first element, create a data.frame , and rbind the list contents.

 i1 <- grepl('^[^0-9]+$', exampledf$Col1)
 lst <- lapply(split(exampledf, cumsum(i1)), function(x) 
   data.frame(year= as.numeric(sub('\\/.*', '',   x[-1,1])), 
              month = as.numeric(sub('.*\\/', '', x[-1,1])),
              Country = x[1,1] ) )
 res <- do.call(rbind, lst)
 row.names(res) <- NULL

 res
 # year month   Country
 #1 2005    12 Argentina
 #2 2005    11 Argentina
 #3 2006    12   Bolivia

Or using data.table , we convert the 'data.frame' to 'data.table' ( setDT(exampledf) ), grouped by the cumsum of the index (from above), we split ( tstrsplit ) on the 'Col1' (removing the first element) with delimiter ( / ). We get two columns out of that. Then, concatenate the first element to create three columns and change the column names with setnames . If we don't need the grouping variable, it can be assigned ( := ) to NULL.

library(data.table)
res1 <- setDT(exampledf)[, c(tstrsplit(Col1[-1], 
        '/'),Country = Col1[1L]), .(i2=cumsum(i1))][,i2:= NULL][]
setnames(res1, 1:2, c('year', 'month'))

data

 exampledf<-data.frame(Col1=c("Argentina","2005/12","2005/11",
          "Bolivia","2006/12"),stringsAsFactors=FALSE)

My approach is not very elegant but tries to clean the data step by step...

edf<-data.frame(c("Argentina","2005/12","2005/11","Bolivia","2006/12"),
                stringsAsFactors=F)

names(edf) <- "x"  # just to give a concise name

# flag if the row shows the month or not
edf$isMonth <- (regexpr("^[0-9]+/[0-9]+$", edf$x) > 0)

# expand the country 
# (i.e. if the row is month, reuse the country from the previous row)
edf$country <- edf$x
for (i in seq(2, nrow(edf))) {
  if (edf$isMonth[i]) {
    edf$country[i] <- edf$country[i-1]
  }
}

# now only the rows with month are relevant
edf <- edf[edf$isMonth,]

This gets you:

     x isMonth   country
2005/12    TRUE Argentina
2005/11    TRUE Argentina
2006/12    TRUE   Bolivia

Now, the remaining task is to split your year-month variable into year and month. In your example code strsplit fails because the function strsplit returns a list, and mutate function conducts vectorized operation rather than element-wise.

In this particular case I find stringr::str_match to be useful.

library(stringr)
matched <- str_match(edf$x, "([0-9]+)/([0-9]+)")
edf$year <- matched[, 2]
edf$month <- matched[, 3]

The result is:

      x isMonth   country year month    
2005/12    TRUE Argentina 2005    12
2005/11    TRUE Argentina 2005    11
2006/12    TRUE   Bolivia 2006    12

An Alternative strategy. It's not concise, but it's easy to follow.

library(tidyr)
df <-data.frame(Country = c("Argentina","2005/12","2005/11","Bolivia","2006/12"),stringsAsFactors=F)
df$dates[grep("[0-9]",df$Country)] <- df$Country[grep("[0-9]",df$Country)]
df$Country[grep("[0-9]",df$Country)] <- NA

replace_with <- 1
for(i in 1:length(df$Country)) {
  if(!is.na(df$Country[i])) {
    replace_with <- df$Country[i]
    next
  } else {
    x[i] <- replace_with
  }
}
df$Country <- x
df <- separate(df, dates, c("Year", "Month"), "/")
df <- na.omit(df)
df
    Country Year Month
2 Argentina 2005    12
3 Argentina 2005    11
5   Bolivia 2006    12

Here's another option. You can use read.mtable from my "SOfun" package along with cSplit from "splitstackshape" and rbindlist from "data.table".

Assuming you have loaded at least the read.mtable function (in case you don't want to install the package) the approach would be:

library(SOfun)
library(splitstackshape)

rbindlist(lapply(read.mtable(textConnection(exampledf[[1]]), "[a-z]"), 
                 cSplit, "V1", "/"), idcol = TRUE)
#          .id V1_1 V1_2
# 1: Argentina 2005   12
# 2: Argentina 2005   11
# 3:   Bolivia 2006   12

Alternatively, you can split the data with read.mtable itself (though I suspect that cSplit might be faster). Thus, the approach would be:

# library(SOfun)
# library(data.table)
rbindlist(read.mtable(textConnection(exampledf[[1]]), "[a-z]", 
                      sep = "/", col.names = c("Year", "Month")), idcol = TRUE)
#          .id Year Month
# 1: Argentina 2005    12
# 2: Argentina 2005    11
# 3:   Bolivia 2006    12

With that approach, you have the added advantage of naming the columns in the process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM