在 R 中的 lubridate 中处理大型数据集中日期的计算有效方法

Question

I have data which looks like this, but is 20 million rows.我有看起来像这样的数据，但有 2000 万行。

library(tidyr)
library(dplyr)
library(stringr)
library(magrittr)
library(lubridate)
library(tidyverse)


df <- data.frame(
  DATE_OF_BIRTH = c("1933-03-31", "1947-06-25", "1901-09-02", "1952-01-22", "1936-07-18", "2020-10-22", "1930-05-18", "1926-05-13"),
  DATE_OF_DEATH = c(NA, "2019-02-04", "2017-10-27", NA, "2021-01-03", NA, NA, NA),
)

What I want to do is我想做的是

A) figure out how old people are, as at 31 December 2019; A) 算出截至 2019 年 12 月 31 日的年龄； and divide them into age groupings并将他们分成不同的年龄组

B) remove people with impossible ages or dates of death B) 删除年龄或死亡日期不可能的人

This is the code I am running to do this这是我正在运行的代码

#Change the missing dates of death into a format recognisable as a date, which is far into the future
df %<>%
  replace_na(list(DATE_OF_DEATH = "01/01/9999"))

#Specify the start and end date of the year of interest
end_yr_date = dmy('31/12/2019')
start_yr_date = dmy('01/01/2019')

df %<>%
  #create age
  mutate(age = floor(interval(start = dmy(DATE_OF_BIRTH), end = end_yr_date) / 
                       duration(num = 1, units = "years"))) %>%
  #and age groupings
  mutate(age_group = cut(age, 
                         breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 150), 
                         labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44", 
                                    "45-49", "50-54","55-59", "60-64", "65-69",
                                    "70-74", "75-79", "80-84", "85+"), right = FALSE))


df %<>%
  #remove people who were born after end date
  filter(!(dmy(DATE_OF_BIRTH) > end_yr_date)) %>%
  #remove people who died before start date
  filter(!(dmy(DATE_OF_DEATH) < start_yr_date)) %>%
  #Remove people with a negative age
  filter(age >= 0) %>%
  #Remove people older than 115
  filter(age < 116)

This runs fine on this sample dataset, but it just keeps running and running and running on the 20 million rows of data.这在这个示例数据集上运行良好，但它只是在 2000 万行数据上继续运行和运行。 I am wondering if there are ways to work with dates which are more computationally efficient, and maybe faster?我想知道是否有办法处理计算效率更高且可能更快的日期？

I'm also wondering if I might have date formats which won't parse (I have removed NA dates, but maybe there are other data entry errors which aren't in the right format) which is why the code just keeps running.我还想知道我是否可能有无法解析的日期格式（我已经删除了 NA 日期，但也许还有其他格式不正确的数据输入错误），这就是代码一直运行的原因。 Does anyone know of an efficient way to determine any date formats which wouldn't parse (which aren't NA)?有谁知道确定任何无法解析的日期格式（不是 NA）的有效方法吗？

Thanks for your help.谢谢你的帮助。

Answer 1

You can change the columns to date class once and include all the filter expression into one.您可以将列更改为日期类一次，并将所有filter表达式合并为一个。

library(dplyr)
library(lubridate)

df %>%
  mutate(across(c(DATE_OF_BIRTH, DATE_OF_DEATH), ymd),
         age = floor(interval(start = DATE_OF_BIRTH, end = end_yr_date) / 
                       duration(num = 1, units = "years")), 
         age_group = cut(age, 
                         breaks = c(seq(0, 85, 5), 150), 
                         labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44", 
                                    "45-49", "50-54","55-59", "60-64", "65-69",
                                    "70-74", "75-79", "80-84", "85+"), right = FALSE)) %>%
  filter(DATE_OF_BIRTH < end_yr_date, DATE_OF_DEATH > start_yr_date, 
         between(age, 0, 116)) -> result

If this is still slow you can switch to data.table .如果这仍然很慢，您可以切换到data.table 。

library(data.table)

setDT(df)
df[, c('DATE_OF_BIRTH', 'DATE_OF_DEATH') := lapply(.SD, ymd), .SDcols = c('DATE_OF_BIRTH', 'DATE_OF_DEATH')] %>%
  .[, age := floor(interval(start = DATE_OF_BIRTH, end = end_yr_date) / 
                duration(num = 1, units = "years"))] %>%
  .[, age_group := cut(age, 
                breaks = c(seq(0, 85, 5), 150), 
                labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44", 
                           "45-49", "50-54","55-59", "60-64", "65-69",
                           "70-74", "75-79", "80-84", "85+"), right = FALSE)] %>%
  .[DATE_OF_BIRTH < end_yr_date & DATE_OF_DEATH > start_yr_date & between(age, 0, 116)] -> result

在 R 中的 lubridate 中处理大型数据集中日期的计算有效方法

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-11-06 02:50:37

在 R 中的 lubridate 中处理大型数据集中日期的计算有效方法

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-11-06 02:50:37

解决方案1
3 已采纳 2020-11-06 02:50:37