I have a data set with different kinds of variables and 1 variable includes a description with year and month ,from that variable i want to extract month and year,but i am unable to fetch.
Sample_Data
var1 var2
203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010
206 2008 MARCH MONTH BROKERAGE
207 UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL
204 BROKERAGE FOR THE MONTH OF MARCH 2008
Expected_output:
var1 var2 month year
203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
206 2008 MARCH MONTH BROKERAGE MARCH 2008
207 UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL APRIL 2009
204 BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
Tried:
library(lubridate)
Sample_Data$month = month(Sample_Data$var2)
Sample_Data$year = year(Sample_Data$var2)
I have tried in different ways like,used lubridate,posixlt but unable to find the solution. Please help me in this way.
We can use extract
from tidyr
by specifying the regex to match the characters as showed in the input dataset.
library(tidyr)
extract(df1, var2, into=c('month', 'year'), '.*\\s+([A-Z]+)\\s+(\\d+)$',
remove=FALSE, convert=TRUE)
# var1 var2 month year
#1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
#3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
Or using base R
, we remove the substring from the beginning of the string in the 'var2', capturing the word ( \\\\w+
) followed by space ( \\\\s+
) followed by numbers ( \\\\d+
) till the end of the string, in the replacement, we specify the capture group ( \\\\1
). We read this using read.table
to create the new columns in 'df1'.
df1[c('month', 'year')] <- read.table(text=sub('.*(\\b\\w+\\s+\\d+)$',
'\\1', df1$var2), stringsAsFactors=FALSE)
df1
# var1 var2 month year
#1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
#3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
NOTE: In both the methods, we are converting the new columns to their respective class
.
df1 <- structure(list(var1 = c(203L, 205L, 206L, 207L, 204L),
var2 = c("UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008",
"UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010",
"UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008",
"UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009",
"UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008"
)), .Names = c("var1", "var2"), class = "data.frame",
row.names = c(NA, -5L))
You can't quite treat it as a date yet, because you need to parse the string. Try t(sapply(strsplit(Sample_Data$var2," "),function(x) x[7:8]))
to get the two columns that you want.
You don't need lubridate because you are not really working with the Date data type. Use strsplit
in base to split your var2
into "words". It looks like month is always the next-to-last word, and year is the last word.
# reproducible example please!
d <- read.table(textConnection("
var1, var2
203, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
205, UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010
206, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
207, UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009
204, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
"), header=TRUE, sep=",", stringsAsFactors=FALSE)
get_month <- function(s) {
words <- unlist(strsplit(s, " "))
words[length(words)-1]
}
get_year <- function(s) {
words <- unlist(strsplit(s, " "))
as.integer(words[length(words)])
}
d$month = sapply(d$var2, get_month)
d$year = lapply(d$var2, get_year)
d
produces the desired output
> d
var1 var2 month year
1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.