简体   繁体   English

将日期中的日期指定为一个月

[英]Assign day of the day year to a month

Sample data 样本数据

    df <- data.frame(ID1 = rep(1:1000, each= 5*365), year = rep(rep(2000:2004, each = 365), times = 1000), 
             day = rep(1:365, times = 1000*5), 
             x= runif(365*1000*5))

This data contains a column day which is the day of the year. 此数据包含列day ,即一年中的某一天。 I need to produce two columns: 我需要生成两列:

  • Month column: a column of month (which month does the day belong) 月份列:月份列(当天所属的月份)

  • Biweek column: which biweek does a day belong to. Biweek专栏:biweek每天都属于哪一个。 There are 24 biweek in a year. 一年有24个双周。 All days <= 15 in a month is the first biweek and > 15 is second biweek. 一个月<= 15的所有日子是第一个双周,并且> 15是第二个双周。 For eg 例如

    • 15th Jan is Biweek 1, 1月15日是Biweek 1,
    • 16-31 Jan is biweek 2, 1月16日至31日是双周2,
    • 1-15 Feb is biweek 3 and 2月1日至15日是双周3和
    • 16-28 Feb is biweek 4 and so on. 2月16日至28日是双周4,依此类推。

For sake of simplicity, I am assuming all the years are non-leap years. 为简单起见,我假设所有年份都是非闰年。

Here's the code I have (with help from RS as well) that creates the two columns. 这是我的代码(在RS的帮助下),它创建了两列。

  # create a vector of days for each month

  months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)

  library(dplyr)


  ptm <- proc.time()
  df <- df %>% mutate(month =  sapply(day, function(x) which(sapply(months, function(y) x %in% y))), # this assigns each day to a month
                           date = as.Date(paste0(year,'-',format(strptime(paste0('1981-',day), '%Y-%j'), '%m-%d'))), # this creates a vector of dates for a non-leap year
                           twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
                 dplyr::select(-date) 
  proc.time() - ptm

  user  system elapsed 
  121.71    0.31  122.43 

My issue is that of the time it takes to run this script and I am looking for a solution that is relatively faster 我的问题是运行此脚本所需的时间,我正在寻找一个相对更快的解决方案

EDIT: To be clear, I have assumed all years must have 365 days. 编辑:要明确,我假设所有年份必须有365天。 In one of the answers below, for the year 2000 (a leap year), Feb has 29 days (last day of Feb is 60 but I want the last day to be 59) and therefore Dec has only 30 days (Dec start with 336 though it should start with 335). 在下面的一个答案中,对于2000年(闰年),2月有29天(2月的最后一天是60天,但我希望最后一天是59天),因此12月只有30天(12月开始时为336天)虽然它应该以335)开头。 I hope this is clear. 我希望这很清楚。 My solution addresses this issue but takes lot of time to run. 我的解决方案解决了这个问题,但需要花费大量时间来运行

Here is a solution using lubridate extractors and replacement functions as mentioned by Frank in a comment . 以下是Frank在评论中提到的使用lubridate提取器和替换功能的解决方案。 The key ones are yday<- , mday() and month() , which respectively set the day of year of a date, get the day of month of a date, and get the month of a date. 关键是mday() yday<-mday()month() ,它们分别设置日期的年份,获取日期的月份,并获取日期的月份。 8 sec running time seems pretty acceptable to me, though I'm sure some optimising could shave that down though there might be a loss of generality. 8秒的运行时间对我来说似乎是可以接受的,尽管我确信一些优化可以减少这种情况,尽管可能会失去一般性。

Note also the use of case_when to ensure the correct numbering of days after Feb 29 on a leap year. 另请注意使用case_when确保闰年2月29日之后的正确天数。

EDIT: Here is a significantly faster solution. 编辑:这是一个明显更快的解决方案。 You can just get the mapping of DOYs to months and biweeks for a single year, and then left_join to the main table. 您可以将DOY的映射到一年的月份和双周,然后将left_join到主表。 0.36s running time, since you no longer have to repetitively create the date. 运行时间为0.36秒,因为您不再需要重复创建日期。 We also bypass having to use case_when , since the join will take care of the missing days. 我们还绕过必须使用case_when ,因为case_when将处理丢失的日子。 See that Day 59 of year 2000 is February and Day 60 is March, as requested. 根据要求,见2000年第59天是2月,第60天是3月。

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
tbl <- tibble(
  ID1 = rep(1:1000, each= 5*365),
  year = rep(rep(2000:2004, each = 365), times = 1000),
  day = rep(1:365, times = 1000*5),
  x= runif(365*1000*5)
)

tictoc::tic("")
doys <- tibble(
  day = rep(1:365),
  date = seq.Date(ymd("2001-1-1"), ymd("2001-12-31"), by = 1),
  month = month(date),
  biweek = case_when(
    mday(date) <= 15 ~ (month * 2) - 1,
    mday(date) > 15  ~ month * 2
  )
)
tbl_out2 <- left_join(tbl, select(doys, -date), by = "day")
tictoc::toc()
#> : 0.36 sec elapsed
tbl_out2
#> # A tibble: 1,825,000 x 6
#>      ID1  year   day     x month biweek
#>    <int> <int> <int> <dbl> <dbl>  <dbl>
#>  1     1  2000     1 0.331    1.     1.
#>  2     1  2000     2 0.284    1.     1.
#>  3     1  2000     3 0.627    1.     1.
#>  4     1  2000     4 0.762    1.     1.
#>  5     1  2000     5 0.460    1.     1.
#>  6     1  2000     6 0.500    1.     1.
#>  7     1  2000     7 0.340    1.     1.
#>  8     1  2000     8 0.952    1.     1.
#>  9     1  2000     9 0.663    1.     1.
#> 10     1  2000    10 0.385    1.     1.
#> # ... with 1,824,990 more rows
tbl_out2[55:65, ]
#> # A tibble: 11 x 6
#>      ID1  year   day     x month biweek
#>    <int> <int> <int> <dbl> <dbl>  <dbl>
#>  1     1  2000    55 0.127    2.     4.
#>  2     1  2000    56 0.779    2.     4.
#>  3     1  2000    57 0.625    2.     4.
#>  4     1  2000    58 0.245    2.     4.
#>  5     1  2000    59 0.640    2.     4.
#>  6     1  2000    60 0.423    3.     5.
#>  7     1  2000    61 0.439    3.     5.
#>  8     1  2000    62 0.105    3.     5.
#>  9     1  2000    63 0.218    3.     5.
#> 10     1  2000    64 0.668    3.     5.
#> 11     1  2000    65 0.589    3.     5.

Created on 2018-04-06 by the reprex package (v0.2.0). reprex包创建于2018-04-06 (v0.2.0)。

You can speed this up almost an order of magnitude by defining date first, reducing redundancy in the date call, and then extracting month from date. 通过先定义日期,减少日期调用中的冗余,然后从日期中提取月份,您可以将此速度提高几乎一个数量级。

    ptm <- proc.time()
    df <- df %>% mutate(
      date = as.Date(paste0(year, "-", day), format = "%Y-%j"), # this creates a vector of dates 
      month = as.numeric(format(date, "%m")), # extract month
      twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
      dplyr::select(-date) 
    proc.time() - ptm

#   user  system elapsed 
#  18.58    0.13   18.75 

Versus original version in the question 与问题中的原始版本相对应

#   user  system elapsed 
# 117.67    0.15  118.45 

Filtered for one year. 过滤一年。 I think it solves the leap issue you described, unless I'm not clear on what you're saying. 我认为它解决了你所描述的跳跃问题,除非我不清楚你在说什么。 Last day of Feb is 59 in the df in my result below, but only because day is 0 indexed. 在我的结果中,2月的最后一天是df中的59,但只是因为day是0索引。

df2000 <- filter(df, year == "2000")
ptm <- proc.time()
df2000 <- df2000 %>% mutate(
  day = day - 1, # dates are 0 indexed
  date = as.Date(day, origin = "2000-01-01"),
  month = as.numeric(as.POSIXlt(date, format = "%Y-%m-%d")$mon + 1),
  bis = month * 2  - (as.numeric(format(date, "%d")) <= 15)
  )
proc.time() - ptm

user  system elapsed 
0.8     0.0     0.8

One year is 0.2 of the whole df, so times reflect that. 一年是整个df的0.2,所以时间反映了这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM