简体   繁体   中英

How to extract month and year from a variable with description in R?

I have a data set with different kinds of variables and 1 variable includes a description with year and month ,from that variable i want to extract month and year,but i am unable to fetch.

Sample_Data

var1   var2 
203    UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008                           
205    UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010                           
206    2008 MARCH MONTH BROKERAGE                            
207    UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL                           
204    BROKERAGE FOR THE MONTH OF MARCH 2008                           


Expected_output:

var1   var2                                            month   year     
203    UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008   MARCH   2008                      
205    UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010     MAY     2010                           
206    2008 MARCH MONTH BROKERAGE                      MARCH   2008                      
207    UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL       APRIL   2009                           
204    BROKERAGE FOR THE MONTH OF MARCH 2008           MARCH   2008

Tried:
library(lubridate)
Sample_Data$month = month(Sample_Data$var2)
Sample_Data$year = year(Sample_Data$var2)

I have tried in different ways like,used lubridate,posixlt but unable to find the solution. Please help me in this way.

We can use extract from tidyr by specifying the regex to match the characters as showed in the input dataset.

library(tidyr)
extract(df1, var2, into=c('month', 'year'), '.*\\s+([A-Z]+)\\s+(\\d+)$', 
             remove=FALSE, convert=TRUE)
#  var1                                          var2 month year
#1  203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2  205   UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010   MAY 2010
#3  206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4  207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5  204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008

Or using base R , we remove the substring from the beginning of the string in the 'var2', capturing the word ( \\\\w+ ) followed by space ( \\\\s+ ) followed by numbers ( \\\\d+ ) till the end of the string, in the replacement, we specify the capture group ( \\\\1 ). We read this using read.table to create the new columns in 'df1'.

df1[c('month', 'year')] <-  read.table(text=sub('.*(\\b\\w+\\s+\\d+)$',
                                   '\\1', df1$var2), stringsAsFactors=FALSE)
df1
#  var1                                          var2 month year
#1  203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2  205   UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010   MAY 2010
#3  206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4  207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5  204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008

NOTE: In both the methods, we are converting the new columns to their respective class .

data

df1 <- structure(list(var1 = c(203L, 205L, 206L, 207L, 204L),
var2 = c("UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008", 
"UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010",
"UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008", 
"UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009",
"UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008"
)), .Names = c("var1", "var2"), class = "data.frame", 
row.names = c(NA, -5L))

You can't quite treat it as a date yet, because you need to parse the string. Try t(sapply(strsplit(Sample_Data$var2," "),function(x) x[7:8])) to get the two columns that you want.

You don't need lubridate because you are not really working with the Date data type. Use strsplit in base to split your var2 into "words". It looks like month is always the next-to-last word, and year is the last word.

# reproducible example please!
d <- read.table(textConnection("
var1, var2
203, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
205, UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010
206, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
207, UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009
204, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
"), header=TRUE, sep=",", stringsAsFactors=FALSE)

get_month <- function(s) {
  words <- unlist(strsplit(s, " "))
  words[length(words)-1]
}
get_year <- function(s) {
  words <- unlist(strsplit(s, " "))
  as.integer(words[length(words)])
}

d$month = sapply(d$var2, get_month)

d$year = lapply(d$var2, get_year)

d

produces the desired output

> d
  var1                                           var2 month year
1  203  UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
2  205    UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010   MAY 2010
3  206  UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
4  207  UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
5  204  UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM