[英]How to extract month and year from a variable with description in R?
我有一个包含不同类型变量的数据集,其中 1 个变量包含带有年份和月份的描述,我想从该变量中提取月份和年份,但我无法获取。
Sample_Data
var1 var2
203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010
206 2008 MARCH MONTH BROKERAGE
207 UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL
204 BROKERAGE FOR THE MONTH OF MARCH 2008
Expected_output:
var1 var2 month year
203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
206 2008 MARCH MONTH BROKERAGE MARCH 2008
207 UPFRONT BROKERAGE FOR 2009 MONTH OF APRIL APRIL 2009
204 BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
Tried:
library(lubridate)
Sample_Data$month = month(Sample_Data$var2)
Sample_Data$year = year(Sample_Data$var2)
我尝试了不同的方法,例如使用 lubridate、posixlt 但无法找到解决方案。 请以这种方式帮助我。
我们可以通过指定正则表达式来匹配输入数据集中显示的字符,从而使用从tidyr
extract
。
library(tidyr)
extract(df1, var2, into=c('month', 'year'), '.*\\s+([A-Z]+)\\s+(\\d+)$',
remove=FALSE, convert=TRUE)
# var1 var2 month year
#1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
#3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
或者使用base R
,我们从 'var2' 中的字符串开头删除子字符串,捕获单词 ( \\\\w+
) 后跟空格 ( \\\\s+
) 后跟数字 ( \\\\d+
) 直到结束字符串,在替换中,我们指定捕获组( \\\\1
)。 我们使用read.table
读取此内容以在 'df1' 中创建新列。
df1[c('month', 'year')] <- read.table(text=sub('.*(\\b\\w+\\s+\\d+)$',
'\\1', df1$var2), stringsAsFactors=FALSE)
df1
# var1 var2 month year
#1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
#3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
#4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
#5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
注意:在这两种方法中,我们都将新列转换为它们各自的class
。
df1 <- structure(list(var1 = c(203L, 205L, 206L, 207L, 204L),
var2 = c("UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008",
"UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010",
"UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008",
"UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009",
"UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008"
)), .Names = c("var1", "var2"), class = "data.frame",
row.names = c(NA, -5L))
您还不能完全将其视为日期,因为您需要解析字符串。 尝试t(sapply(strsplit(Sample_Data$var2," "),function(x) x[7:8]))
得到你想要的两列。
您不需要 lubridate 因为您并没有真正使用 Date 数据类型。 在 base 中使用strsplit
将您的var2
拆分为“单词”。 看起来月份总是倒数第二个单词,而年份是最后一个单词。
# reproducible example please!
d <- read.table(textConnection("
var1, var2
203, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
205, UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010
206, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
207, UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009
204, UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008
"), header=TRUE, sep=",", stringsAsFactors=FALSE)
get_month <- function(s) {
words <- unlist(strsplit(s, " "))
words[length(words)-1]
}
get_year <- function(s) {
words <- unlist(strsplit(s, " "))
as.integer(words[length(words)])
}
d$month = sapply(d$var2, get_month)
d$year = lapply(d$var2, get_year)
d
产生所需的输出
> d
var1 var2 month year
1 203 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
2 205 UPFRONT BROKERAGE FOR THE MONTH OF MAY 2010 MAY 2010
3 206 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
4 207 UPFRONT BROKERAGE FOR THE MONTH OF APRIL 2009 APRIL 2009
5 204 UPFRONT BROKERAGE FOR THE MONTH OF MARCH 2008 MARCH 2008
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.