I am trying to split the following dataframe column into 3 columns depending on what the contents are. I tried using dplyr and mutate because I wanted to learn them better, but any suggestions would be welcome.
exampledf<-data.frame(c("Argentina","2005/12","2005/11","Bolivia","2006/12"),stringsAsFactors=F)
mutate(exampledf,month=strsplit(exampledf[,1],"/")[1],month=strsplit(exampledf[,1],"/")[2])
My Goal:
Year Month Country
2005 12 Argentina
2005 11 Argentina
2006 12 Bolivia
This is very close to this SO post, but it doesnt address my repeating country issue.
We create a logical index for rows that have no numbers ('i1'), get the cumulative sum of that, split
the dataset with that grouping index, extract the 'year', 'month' with ( sub
), and the 'Country' as the first element, create a data.frame
, and rbind
the list
contents.
i1 <- grepl('^[^0-9]+$', exampledf$Col1)
lst <- lapply(split(exampledf, cumsum(i1)), function(x)
data.frame(year= as.numeric(sub('\\/.*', '', x[-1,1])),
month = as.numeric(sub('.*\\/', '', x[-1,1])),
Country = x[1,1] ) )
res <- do.call(rbind, lst)
row.names(res) <- NULL
res
# year month Country
#1 2005 12 Argentina
#2 2005 11 Argentina
#3 2006 12 Bolivia
Or using data.table
, we convert the 'data.frame' to 'data.table' ( setDT(exampledf)
), grouped by the cumsum
of the index (from above), we split ( tstrsplit
) on the 'Col1' (removing the first element) with delimiter ( /
). We get two columns out of that. Then, concatenate the first element to create three columns and change the column names with setnames
. If we don't need the grouping variable, it can be assigned ( :=
) to NULL.
library(data.table)
res1 <- setDT(exampledf)[, c(tstrsplit(Col1[-1],
'/'),Country = Col1[1L]), .(i2=cumsum(i1))][,i2:= NULL][]
setnames(res1, 1:2, c('year', 'month'))
exampledf<-data.frame(Col1=c("Argentina","2005/12","2005/11",
"Bolivia","2006/12"),stringsAsFactors=FALSE)
My approach is not very elegant but tries to clean the data step by step...
edf<-data.frame(c("Argentina","2005/12","2005/11","Bolivia","2006/12"),
stringsAsFactors=F)
names(edf) <- "x" # just to give a concise name
# flag if the row shows the month or not
edf$isMonth <- (regexpr("^[0-9]+/[0-9]+$", edf$x) > 0)
# expand the country
# (i.e. if the row is month, reuse the country from the previous row)
edf$country <- edf$x
for (i in seq(2, nrow(edf))) {
if (edf$isMonth[i]) {
edf$country[i] <- edf$country[i-1]
}
}
# now only the rows with month are relevant
edf <- edf[edf$isMonth,]
This gets you:
x isMonth country
2005/12 TRUE Argentina
2005/11 TRUE Argentina
2006/12 TRUE Bolivia
Now, the remaining task is to split your year-month variable into year and month. In your example code strsplit
fails because the function strsplit
returns a list, and mutate
function conducts vectorized operation rather than element-wise.
In this particular case I find stringr::str_match
to be useful.
library(stringr)
matched <- str_match(edf$x, "([0-9]+)/([0-9]+)")
edf$year <- matched[, 2]
edf$month <- matched[, 3]
The result is:
x isMonth country year month
2005/12 TRUE Argentina 2005 12
2005/11 TRUE Argentina 2005 11
2006/12 TRUE Bolivia 2006 12
An Alternative strategy. It's not concise, but it's easy to follow.
library(tidyr)
df <-data.frame(Country = c("Argentina","2005/12","2005/11","Bolivia","2006/12"),stringsAsFactors=F)
df$dates[grep("[0-9]",df$Country)] <- df$Country[grep("[0-9]",df$Country)]
df$Country[grep("[0-9]",df$Country)] <- NA
replace_with <- 1
for(i in 1:length(df$Country)) {
if(!is.na(df$Country[i])) {
replace_with <- df$Country[i]
next
} else {
x[i] <- replace_with
}
}
df$Country <- x
df <- separate(df, dates, c("Year", "Month"), "/")
df <- na.omit(df)
df
Country Year Month
2 Argentina 2005 12
3 Argentina 2005 11
5 Bolivia 2006 12
Here's another option. You can use read.mtable
from my "SOfun" package along with cSplit
from "splitstackshape" and rbindlist
from "data.table".
Assuming you have loaded at least the read.mtable
function (in case you don't want to install the package) the approach would be:
library(SOfun)
library(splitstackshape)
rbindlist(lapply(read.mtable(textConnection(exampledf[[1]]), "[a-z]"),
cSplit, "V1", "/"), idcol = TRUE)
# .id V1_1 V1_2
# 1: Argentina 2005 12
# 2: Argentina 2005 11
# 3: Bolivia 2006 12
Alternatively, you can split the data with read.mtable
itself (though I suspect that cSplit
might be faster). Thus, the approach would be:
# library(SOfun)
# library(data.table)
rbindlist(read.mtable(textConnection(exampledf[[1]]), "[a-z]",
sep = "/", col.names = c("Year", "Month")), idcol = TRUE)
# .id Year Month
# 1: Argentina 2005 12
# 2: Argentina 2005 11
# 3: Bolivia 2006 12
With that approach, you have the added advantage of naming the columns in the process.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.