简体   繁体   中英

R: How can I compute number of occurrences and average delta times for unique column-column matches in a dataframe

I have a dataset that represents patients visits to various doctors in a certain practice throughout a year.

Example-

doctor    patient_no  datetime

dr.kahn   1561        1/21/19 10:30:00
dr.gould  1397        2/06/19 12:30:00
dr.amoor  1596        2/11/19 9:00:00
dr.gould  995         10/07/19 12:30:00
dr.kahn   1561        10/14/19 9:30

I'm trying to create a new dataframe where each row is a unique doctor-patient pairing and shows the numbers of times that patient visited that doctor, along with the average time surpassed between visits for that particular patient-doctor pairing. So for instance if patient A went to dr.kahn 4 times in a year, what was the average amount of time in between patient A's consecutive appointments to dr.kahn.

Example-

doctor   patient_no   number_of_visits  avg_time_passed_between_appointments

dr.gould   1054       7                 2 months 1 days  2:00:00
dr.gould   1099       2                 5 months 10 days 00:00:00
dr.kahn    875        12                0 months 26 days 0:30:00

Any help would be appreciated. Thanks!

Here's a dplyr approach:

library(tidyverse)
df %>%
  mutate(datetime = lubridate::mdy_hm(datetime)) %>%
  group_by(doctor, patient_no) %>%
  summarize(count = n(),
            avg_days_between = (max(datetime) - min(datetime)) / lubridate::ddays(count - 1)) %>%
  ungroup()

## A tibble: 4 x 4
#  doctor   patient_no count avg_days_between
#  <chr>         <dbl> <int>            <dbl>
#1 dr.amoor       1596     1             NaN 
#2 dr.gould        995     1             NaN 
#3 dr.gould       1397     1             NaN 
#4 dr.kahn        1561     2             266.

Or you could calculate each lag and use a different method, like the median, to characterize avg delta.

df %>%
  group_by(doctor, patient_no) %>%
  mutate(datetime = lubridate::mdy_hm(datetime),
         # coalesce helps
         days_since_last = coalesce(c(datetime - lag(datetime))/
                                      lubridate::ddays(1), 0)) %>%
  summarize(count = n(),
            median_time_between = median(days_since_last))

## A tibble: 4 x 4
# Groups:   doctor [3]
#  doctor   patient_no count median_time_between
#  <chr>         <dbl> <int>               <dbl>
#1 dr.amoor       1596     1                  0 
#2 dr.gould        995     1                  0 
#3 dr.gould       1397     1                  0 
#4 dr.kahn        1561     2                133.

sample data

df <- tibble::tribble(
     ~doctor, ~patient_no,       ~datetime,
   "dr.kahn",        1561, "1/21/19 10:30",
  "dr.gould",        1397,  "2/6/19 12:30",
  "dr.amoor",        1596,  "2/11/19 9:00",
  "dr.gould",         995, "10/7/19 12:30",
   "dr.kahn",        1561, "10/14/19 9:30"
  )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM