[英]categorize month `factor` to time periods in data.frame
Thanks to @akrun and @ulfelder I realized my initial example wasn't complex enough as I only had one year. 感谢@akrun和@ulfelder,我意识到我最初的例子不够复杂,因为我只有一年。 Data covering several years might be more realistic, and more useful for others. 涵盖数年的数据可能更现实,对其他人更有用。
Say instead my data was, 假设我的数据是
df <- structure(list(yr_month = structure(1:7, .Label = c("2014-1", "2014-2",
"2014-3", "2015-4", "2016-4", "2016-6", "2017-7"), class = "factor"),
a = c(4.14, 2.83, 3.71, 4.15, 4.63, 4.91, 5.31), b = c(4.25,
3.5, 3.5, 3.5, 3.5, 3.5, 5)), .Names = c("yrQ", "a", "b"
), row.names = c(NA, 7L), class = "data.frame")
df
# yrQ a b
# 1 2014-1 4.14 4.25
# 2 2014-2 2.83 3.50
# 3 2014-3 3.71 3.50
# 4 2015-4 4.15 3.50
# 5 2016-4 4.63 3.50
# 6 2016-6 4.91 3.50
# 7 2017-7 5.31 5.00
and I wanted to crate a category covering before Mar 2014, 2014-3, between 2014-3 and 2016-4, and after 2016-4. 我想创建一个涵盖2014年3月之前,2014-3年之前,2014-3年至2016-4年之间以及2016-4年之后的类别。 so that I got something like this, 这样我就得到了这样的东西
# yr.cat yrQ a b
# 1 "A" 2014-1 4.14 4.25
# 2 "A" 2014-2 2.83 3.50
# 3 "B" 2014-3 3.71 3.50
# 4 "B" 2015-4 4.15 3.50
# 5 "B" 2016-4 4.63 3.50
# 6 "C" 2016-6 4.91 3.50
# 7 "C" 2017-7 5.31 5.00
Say I have a data set like this, 说我有一个这样的数据集,
df <- structure(list(yr_month = structure(1:7, .Label = c("2016-1", "2016-2",
"2016-3", "2016-4", "2016-5", "2016-6", "2016-7"), class = "factor"),
a = c(4.14, 2.83, 3.71, 4.15, 4.63, 4.91, 5.31), b = c(4.25,
3.5, 3.5, 3.5, 3.5, 3.5, 5)), .Names = c("yrQ", "a", "b"
), row.names = c(NA, 7L), class = "data.frame")
df
# yrQ a b
# 1 2016-1 4.14 4.25
# 2 2016-2 2.83 3.50
# 3 2016-3 3.71 3.50
# 4 2016-4 4.15 3.50
# 5 2016-5 4.63 3.50
# 6 2016-6 4.91 3.50
# 7 2016-7 5.31 5.00
Now, I can use ifelse()
to categorize a
numeric variable. 现在,我可以使用ifelse()
来分类a
数字变量。 Like this, 像这样,
df$a.cat <- ifelse(df$a < 3.8, c("tiny"), ifelse(df$a < 4.8, c("medium"), c("huge")) )
df
# yrQ a b a.cat
# 1 2016-1 4.14 4.25 medium
# 2 2016-2 2.83 3.50 tiny
# 3 2016-3 3.71 3.50 tiny
# 4 2016-4 4.15 3.50 medium
# 5 2016-5 4.63 3.50 medium
# 6 2016-6 4.91 3.50 huge
# 7 2016-7 5.31 5.00 huge
but, what if I want to crate a variable signifying some time periods. 但是,如果我要创建一个表示某些时间段的变量该怎么办。 Say before Mar 2016, 2016-3
, between 2016-3
and 2016-5
, and after 2016-5
. 2016年3月,之前说2016-3
之间2016-3
和2016-5
,经过2016-5
。 I realize I can transform the data to ts
and then use window()
to cut it up and then put it back together, but isn't there a smarter way to get to something like this using if else on yrQ
? 我意识到我可以将数据转换为ts
,然后使用window()
进行分割,然后再放回去,但是在yrQ
上使用if else是否不是更明智的方式呢?
It's something like this I want to get to, 我想要达到这样的目标
yr.cat yrQ a b
1 "A" 2016-1 4.14 4.25
2 "A" 2016-2 2.83 3.50
3 "B" 2016-3 3.71 3.50
4 "B" 2016-4 4.15 3.50
5 "B" 2016-5 4.63 3.50
6 "C" 2016-6 4.91 3.50
7 "C" 2016-7 5.31 5.00
We can use cut
after extracting the month substring from the 'yrQ' 从“ yrQ”中提取月份子串后,我们可以使用cut
df$yr.cat <- cut(as.numeric(sub(".*-", "", df$yrQ)),
breaks = c(-Inf,2, 5, Inf), labels = LETTERS[1:3])
df$yr.cat
#[1] A A B B B C C
#Levels: A B C
Based on the updated example 根据更新的示例
cut(as.numeric(sub("-", ".", df$yrQ)),
breaks = c(-Inf, 2014.2, 2016.5, Inf), labels = LETTERS[1:3])
#[1] A A B B B C C
#Levels: A B C
The input data provided in the question seems inconsistent referring to the same column as yrQ
and yr_month
at different points in the data structure. 问题中提供的输入数据在数据结构中的不同点yrQ
与yr_month
和yr_month
相同的列时似乎不一致。 We have assumed this input instead which is the same except that we replaced yrQ
in .Names
(which is suggestive of year/qtr rather than year/month) with yr_month
for consistency with the same name shown in list()
. 我们假设此输入是相同的,除了我们用yrQ
替换了yrQ
中的.Names
(暗示year / qtr而不是year / month),以yr_month
与list()
显示的相同名称一致。
df <- structure(list(yr_month = structure(1:7, .Label = c("2014-1", "2014-2",
"2014-3", "2015-4", "2016-4", "2016-6", "2017-7"), class = "factor"),
a = c(4.14, 2.83, 3.71, 4.15, 4.63, 4.91, 5.31), b = c(4.25,
3.5, 3.5, 3.5, 3.5, 3.5, 5)), .Names = c("yr_month", "a", "b"
), row.names = c(NA, 7L), class = "data.frame")
The example data in the question only has one digit months but we assume it needs to work even if there are a mix of 1 digit (Jan, Feb, ..., Sep) and 2 digit (Oct, Nov, Dec) months. 问题中的示例数据只有一个数字月份,但我们假设即使有1个数字(1月,2月,...,9月)和2个数字(10月,11月,12月)的混合,它也需要工作。
1) Convert to "yearmon"
class (which may also help if we need to do other things with this column) and perform a comparison to each cut point and add them giving a number 0, 1 or 2 representing before, between and after respectively. 1)转换为"yearmon"
类(如果我们需要对本栏做其他事情,这也可能会有所帮助),并对每个切点进行比较,并将它们相加,分别给出一个数字0、1或2,分别表示之前,之后和之后。 Then add 1 and use that as a subscript to a vector of the category names (here LETTERS
). 然后加1并将其用作下标到类别名称的向量(此处为LETTERS
)。 This could be extended to more categories by just adding more comparison terms. 只需添加更多比较项,就可以将其扩展到更多类别。
library(zoo)
df$yr_month <- as.yearmon(df$yr_month) ##
transform(df, yr.cat = LETTERS[ (yr_month >= "2014-03") + (yr_month > "2016-04") + 1])
giving: 赠送:
yr_month a b yr.cat
1 Jan 2014 4.14 4.25 A
2 Feb 2014 2.83 3.50 A
3 Mar 2014 3.71 3.50 B
4 Apr 2015 4.15 3.50 B
5 Apr 2016 4.63 3.50 B
6 Jun 2016 4.91 3.50 C
7 Jul 2017 5.31 5.00 C
2) To do it without any packages change the line marked ## in (1) to the line of code below. 2)要在没有任何程序包的情况下将(1)中标记为##的行更改为以下代码行。 Here we convert yr_month
to "Date"
class and then remove the day part of its character representation. 在这里,我们将yr_month
转换为"Date"
类,然后删除其字符表示形式的day部分。 This leaves 2 digits for the month so that comparisons between 1 and 2 digt months work properly. 这样,该月剩下2位数字,这样1到2个数字月之间的比较就可以正常进行。 (In (1) "yearmon"
class handles that automatically.) (在(1)中, "yearmon"
类自动处理该问题。)
df$yr_month <- sub("...$", "", as.Date(paste0(df$yr_month, -1)))
Revised Have made a number of revisions. 修订已进行了许多修订。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.