简体   繁体   中英

Group data frame columns by their dates (which comprise column titles) and summarize instances of 1s and 0s in R

I have a bit of annoying coding problem, about which I'd appreciate your assistance.

Here are some sample data:

ID     "2013-03-20"    "2013-04-09"    "2013-04-11"    "2013-04-17"    
5167f            0               0               0               1  
1214m            0               0               0               0  
1844f            0               1               1               0  
2113m            0               0               1               1  

Here's the rub: notice how the column names of the sample data frame comprise dates, and some dates can be close to one another. The ID column is a simple unique identifier for the observations.

Here are my goals:

(1) I'd like to be able to first group the columns by whether or not they each fall within a 2-week (or 14-day) range (ie that is defined by 2-week increments starting on "2013-03-20" until "2016-12-20" ); thus, if they do fall within the same range, they'd receive the same identifier (for a new data frame; eg, Period1 for any dates that fall between "2013-03-20" and "2013-04-03" ).

(2) Once all of the date columns are assigned to a particular period, I then want to summarize the cell data (of 0s and 1s) per period in the following way: if any 1 appears in a particular period for an individual (at all) then that individual would receive a 1 for the whole period (also, if the individual only has 0s for that period, then it would receive a 0 for that single period).

(3) Once this workflow is defined, I then want to group the dates by seasons and years based on their month codes (eg, WinterYYYY = December, January, February; SpringYYYY = March, April, May; SummerYYYY = June, July, August; and FallYYYY = September, October, November) to produce a new data frame.

In summary, to demonstrate the products manually:

(End Product for Goals 1 and 2; ie for just first two columns in the sample data [date ranges in parentheses are just for a guide])

ID     Period1 ("2013-03-20" - "2013-04-03")        Period2 ("2013-04-04" - "2013-04-18")      
5167f                                    0                                 1    
1214m                                    0                                 0    
1844f                                    0                                 1    
2113m                                    0                                 1    

(End Product for Goals 2 and 3; ie for all columns in the sample data [month ranges in parentheses are just for a guide])

ID                Spring2013  (March - May)    
5167f                                    1  
1214m                                    0  
1844f                                    1  
2113m                                    1

Perhaps, something from the dplyr package could be useful, but I'm not quite sure.

Thank you for any help, in advance. Please feel free to ask any follow-up questions for clarifications.

-AD-

Please put your data in a tidy format first.

library(dplyr)
data <- gather(data, date, value, -ID )

Then try:

library(lubridate)
data$date  <- ymd(data$date)
data <-  mutate(data, period = date - as.Date("2013-03-20")) #difference in days
data <- mutate(data, period2 = ceiling(as.numeric(data$period)/14))
data$period2 <- ifelse(data$period2 == 0, 1, data$period2) #change period 0 to period 1

newdat <- data %>% 
          group_by(ID, period2) %>%
          summarise(result = ifelse(sum(value)>0, 1, 0))

Use the spread() function to change back to the original format.

Here is a solution using functions from tidyverse .

# Load packages
library(tidyverse)
library(data.table)
library(lubridate)

# Create example data frames
dt <- fread("ID     '2013-03-20'    '2013-04-09'    '2013-04-11'    '2013-04-17'    
5167f            0               0               0               1  
                 1214m            0               0               0               0  
                 1844f            0               1               1               0  
                 2113m            0               0               1               1")

The key is to prepare a table showing the association between date and grouping variable, such as period, month, or season. In this example, dt_merge is such table.

dt_merge <- data_frame(
  # Create a column showing the beginning date
  Date1 = seq(from = ymd("2013-03-20"), to = ymd("2016-12-20"), by = "2 weeks")) %>%
  # Create  a column showing the end date of each period
  mutate(Date2 = lead(Date1)) %>%
  # Adjust Date1
  mutate(Date1 = if_else(Date1 == ymd("2013-03-20"), Date1, Date1 + 1)) %>%
  # Remove the last row
  drop_na(Date2) %>%
  # Create date list
  mutate(Dates = map2(Date1, Date2, function(x, y){ seq(x, y, by = "day") })) %>%
  unnest() %>%
  # Create Group ID
  mutate(RunID = group_indices_(., dots. = c("Date1", "Date2"))) %>%
  # Create Period ID
  mutate(Period = paste0("Period", RunID)) %>%
  # Add a column showing Month
  mutate(Month = month(Dates)) %>%
  # Add a column showing Year
  mutate(Year = year(Dates)) %>%
  # Add a column showing season
  mutate(Season = case_when(
    Month %in% 3:5            ~ "Spring",
    Month %in% 6:8            ~ "Summer",
    Month %in% 9:11           ~ "Fall",
    Month %in% c(12, 1, 2)    ~ "Winter",
    TRUE                      ~ NA_character_
  )) %>%
  # Combine Season and Year
  mutate(SeasonYear = paste0(Season, Year)) %>%
  select(-Date1, -Date2, -RunID)

After this step, it becomes easy to generate the output you want to. In this example, dt3 is the first end product. dt4 is the second product.

dt2 <- dt %>%
  # Reshape the data frame
  gather(Date, Value, -ID) %>%
  # Convert Date to date class
  mutate(Date = ymd(Date)) %>%
  # Join dt_merge
  left_join(dt_merge, by = c("Date" = "Dates"))

# Product 1
dt3 <- dt2 %>%
  group_by(ID, Period) %>%
  summarise(Value = max(Value)) %>%
  spread(Period, Value)

# Product 2
dt4 <- dt2 
  group_by(ID, SeasonYear) %>%
  summarise(Value = max(Value)) %>%
  spread(SeasonYear, Value) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM