简体   繁体   English

在大型数据框中转换为日期时,如何为缺少日期和月份的混乱日期数据指定日期和月份

[英]How to specify day and month for messy date data with missing day and month when converting to date in large data frame

I have a large date frame of over 100k rows. 我有一个超过10万行的大型日期框架。 The date column contains dates in multiple formats such as "%m/%d/%Y" , "%Y-%m" , "%Y" , and "%Y-%m-%d" . date列包含多种格式的日期,例如"%m/%d/%Y""%Y-%m""%Y""%Y-%m-%d" I can convert these all to dates with parse_date_time() from lubridate . 我可以将这些转换所有与日期parse_date_time()lubridate

dates <- c("05/10/1983","8/17/2014","1953-12","1975","2001-06-17")

parse_date_time(dates, orders = c("%m/%d/%Y","%Y-%m","%Y","%Y-%m-%d"))

[1] "1983-05-10 UTC" "2014-08-17 UTC" "1953-12-01 UTC" "1975-01-01 UTC" "2001-06-17 UTC"

But as you can see, this sets dates with missing day to the first of the month and dates with missing month and day to the first of the year. 但是如您所见,这会将缺少日期的日期设置为该月的第一天,而将缺少日期和日期的日期设置为该年的第一天。 How can I set those to the 15th and June 15th, respectively? 如何将它们分别设置为15日和6月15日?

Use nchar to check the dates vector and paste what is missing. 使用nchar检查日期向量并paste缺少的内容。

library(lubridate)

dates <- c("05/10/1983","8/17/2014","1953-12","1975","2001-06-17")


dates <- ifelse(nchar(dates) == 4, paste(dates, "06-15", sep = "-"),
             ifelse(nchar(dates) == 7, paste(dates, 15, sep = "-"), dates))
dates
#[1] "05/10/1983" "8/17/2014"  "1953-12-15" "1975-06-15"
#[5] "2001-06-17"

parse_date_time(dates, orders = c("%m/%d/%Y","%Y-%m","%Y","%Y-%m-%d"))
#[1] "1983-05-10 UTC" "2014-08-17 UTC" "1953-12-15 UTC"
#[4] "1975-06-15 UTC" "2001-06-17 UTC"

Another solution would be to use an index vector, also based on nchar . 另一种解决方案是使用同样基于nchar的索引向量。

n <- nchar(dates)
dates[n == 4] <- paste(dates[n == 4], "06-15", sep = "-")
dates[n == 7] <- paste(dates[n == 7], "15", sep = "-")

dates
#[1] "05/10/1983" "8/17/2014"  "1953-12-15" "1975-06-15"
#[5] "2001-06-17"

As you can see, the result is the same as with ifelse . 如您所见,结果与ifelse相同。

Here's another way of doing that - based on orders : 这是另一种方式-基于orders

library(lubridate)
dates <- c("05/10/1983","8/17/2014","1953-12","1975","2001-06-17")

parseDates <- function(x, orders = c('mdY', 'dmY', 'Ymd', 'Y', 'Ym')){
  fmts <- guess_formats(x, orders = orders)
  dte <- parse_date_time(x, orders = fmts[1], tz = 'UTC')
  if(!grepl('m', fmts[1]) ){
    dte <- dte + days(165)
    return(dte)
  }
  if(!grepl('d', fmts[1]) ){
    dte <- dte + days(14)
  }
  return(dte)
}

output 产量

> parseDates(dates[4])
[1] "1975-06-15 UTC"
> parseDates(dates[3])
[1] "1953-12-15 UTC"

This way for different date formats you only need to change the orders argument while the rest is done using lubridate . 这样,对于不同的日期格式,您只需要更改orders参数,而其余的则使用lubridate完成。

Hope this is helpful! 希望这会有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM