[英]Assign day of the day year to a month
Sample data 样本数据
df <- data.frame(ID1 = rep(1:1000, each= 5*365), year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5))
This data contains a column day
which is the day of the year. 此数据包含列day
,即一年中的某一天。 I need to produce two columns: 我需要生成两列:
Month column: a column of month (which month does the day belong) 月份列:月份列(当天所属的月份)
Biweek column: which biweek does a day belong to. Biweek专栏:biweek每天都属于哪一个。 There are 24 biweek in a year. 一年有24个双周。 All days <= 15 in a month is the first biweek and > 15 is second biweek. 一个月<= 15的所有日子是第一个双周,并且> 15是第二个双周。 For eg 例如
For sake of simplicity, I am assuming all the years are non-leap years. 为简单起见,我假设所有年份都是非闰年。
Here's the code I have (with help from RS as well) that creates the two columns. 这是我的代码(在RS的帮助下),它创建了两列。
# create a vector of days for each month
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
library(dplyr)
ptm <- proc.time()
df <- df %>% mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y))), # this assigns each day to a month
date = as.Date(paste0(year,'-',format(strptime(paste0('1981-',day), '%Y-%j'), '%m-%d'))), # this creates a vector of dates for a non-leap year
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
user system elapsed
121.71 0.31 122.43
My issue is that of the time it takes to run this script and I am looking for a solution that is relatively faster 我的问题是运行此脚本所需的时间,我正在寻找一个相对更快的解决方案
EDIT: To be clear, I have assumed all years must have 365 days. 编辑:要明确,我假设所有年份必须有365天。 In one of the answers below, for the year 2000 (a leap year), Feb has 29 days (last day of Feb is 60 but I want the last day to be 59) and therefore Dec has only 30 days (Dec start with 336 though it should start with 335). 在下面的一个答案中,对于2000年(闰年),2月有29天(2月的最后一天是60天,但我希望最后一天是59天),因此12月只有30天(12月开始时为336天)虽然它应该以335)开头。 I hope this is clear. 我希望这很清楚。 My solution addresses this issue but takes lot of time to run. 我的解决方案解决了这个问题,但需要花费大量时间来运行
Here is a solution using lubridate
extractors and replacement functions as mentioned by Frank in a comment . 以下是Frank在评论中提到的使用lubridate
提取器和替换功能的解决方案。 The key ones are yday<-
, mday()
and month()
, which respectively set the day of year of a date, get the day of month of a date, and get the month of a date. 关键是mday()
yday<-
, mday()
和month()
,它们分别设置日期的年份,获取日期的月份,并获取日期的月份。 8 sec running time seems pretty acceptable to me, though I'm sure some optimising could shave that down though there might be a loss of generality. 8秒的运行时间对我来说似乎是可以接受的,尽管我确信一些优化可以减少这种情况,尽管可能会失去一般性。
Note also the use of case_when
to ensure the correct numbering of days after Feb 29 on a leap year. 另请注意使用case_when
确保闰年2月29日之后的正确天数。
EDIT: Here is a significantly faster solution. 编辑:这是一个明显更快的解决方案。 You can just get the mapping of DOYs to months and biweeks for a single year, and then left_join
to the main table. 您可以将DOY的映射到一年的月份和双周,然后将left_join
到主表。 0.36s running time, since you no longer have to repetitively create the date. 运行时间为0.36秒,因为您不再需要重复创建日期。 We also bypass having to use case_when
, since the join will take care of the missing days. 我们还绕过必须使用case_when
,因为case_when
将处理丢失的日子。 See that Day 59 of year 2000 is February and Day 60 is March, as requested. 根据要求,见2000年第59天是2月,第60天是3月。
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
tbl <- tibble(
ID1 = rep(1:1000, each= 5*365),
year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5)
)
tictoc::tic("")
doys <- tibble(
day = rep(1:365),
date = seq.Date(ymd("2001-1-1"), ymd("2001-12-31"), by = 1),
month = month(date),
biweek = case_when(
mday(date) <= 15 ~ (month * 2) - 1,
mday(date) > 15 ~ month * 2
)
)
tbl_out2 <- left_join(tbl, select(doys, -date), by = "day")
tictoc::toc()
#> : 0.36 sec elapsed
tbl_out2
#> # A tibble: 1,825,000 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 1 0.331 1. 1.
#> 2 1 2000 2 0.284 1. 1.
#> 3 1 2000 3 0.627 1. 1.
#> 4 1 2000 4 0.762 1. 1.
#> 5 1 2000 5 0.460 1. 1.
#> 6 1 2000 6 0.500 1. 1.
#> 7 1 2000 7 0.340 1. 1.
#> 8 1 2000 8 0.952 1. 1.
#> 9 1 2000 9 0.663 1. 1.
#> 10 1 2000 10 0.385 1. 1.
#> # ... with 1,824,990 more rows
tbl_out2[55:65, ]
#> # A tibble: 11 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 55 0.127 2. 4.
#> 2 1 2000 56 0.779 2. 4.
#> 3 1 2000 57 0.625 2. 4.
#> 4 1 2000 58 0.245 2. 4.
#> 5 1 2000 59 0.640 2. 4.
#> 6 1 2000 60 0.423 3. 5.
#> 7 1 2000 61 0.439 3. 5.
#> 8 1 2000 62 0.105 3. 5.
#> 9 1 2000 63 0.218 3. 5.
#> 10 1 2000 64 0.668 3. 5.
#> 11 1 2000 65 0.589 3. 5.
Created on 2018-04-06 by the reprex package (v0.2.0). 由reprex包创建于2018-04-06 (v0.2.0)。
You can speed this up almost an order of magnitude by defining date first, reducing redundancy in the date call, and then extracting month from date. 通过先定义日期,减少日期调用中的冗余,然后从日期中提取月份,您可以将此速度提高几乎一个数量级。
ptm <- proc.time()
df <- df %>% mutate(
date = as.Date(paste0(year, "-", day), format = "%Y-%j"), # this creates a vector of dates
month = as.numeric(format(date, "%m")), # extract month
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
# user system elapsed
# 18.58 0.13 18.75
Versus original version in the question 与问题中的原始版本相对应
# user system elapsed
# 117.67 0.15 118.45
Filtered for one year. 过滤一年。 I think it solves the leap issue you described, unless I'm not clear on what you're saying. 我认为它解决了你所描述的跳跃问题,除非我不清楚你在说什么。 Last day of Feb is 59 in the df in my result below, but only because day is 0 indexed. 在我的结果中,2月的最后一天是df中的59,但只是因为day是0索引。
df2000 <- filter(df, year == "2000")
ptm <- proc.time()
df2000 <- df2000 %>% mutate(
day = day - 1, # dates are 0 indexed
date = as.Date(day, origin = "2000-01-01"),
month = as.numeric(as.POSIXlt(date, format = "%Y-%m-%d")$mon + 1),
bis = month * 2 - (as.numeric(format(date, "%d")) <= 15)
)
proc.time() - ptm
user system elapsed
0.8 0.0 0.8
One year is 0.2 of the whole df, so times reflect that. 一年是整个df的0.2,所以时间反映了这一点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.