简体   繁体   English

R:提取满足多个条件的 ID 数量

[英]R: extracting number of IDs that fulfill several conditions

I want to identify those IDs in a dataset, which newly developed a disease.我想在新开发的疾病的数据集中识别这些 ID。 The dataset is in form of a diary in which people daily answer a "yes/no" question on whether they have the disease.该数据集采用日记的形式,人们每天在日记中回答关于他们是否患有这种疾病的“是/否”问题。

ID <- c(1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
Date <- c("2020-03-10","2020-03-11","2020-03-12","2020-03-13","2020-03-14","2020-03-12","2020-03-13","2020-03-14","2020-03-15","2020-03-16","2020-03-17","2020-03-18", "2020-03-12","2020-03-13","2020-03-14","2020-03-15","2020-03-16","2020-03-17","2020-03-18","2020-03-19","2020-03-20")
Disease <- c("No","No","Yes","Yes","Yes","No","No","No", "Yes","Yes","Yes","No","Yes","Yes","No","No","No","Yes","Yes","Yes","Yes")

df <- data.frame(ID, Date, Disease)

df
ID   Date         Disease
1    2020-03-10   No
1    2020-03-11   No
1    2020-03-12   Yes
1    2020-03-13   Yes
1    2020-03-14   Yes
2    2020-03-12   No
2    2020-03-13   No
2    2020-03-14   No
2    2020-03-15   Yes
2    2020-03-16   Yes
2    2020-03-17   Yes
2    2020-03-18   No
3    2020-03-12   Yes
3    2020-03-13   Yes
3    2020-03-14   No
3    2020-03-15   No
3    2020-03-16   No
3    2020-03-17   Yes
3    2020-03-18   Yes
3    2020-03-19   Yes
3    2020-03-20   Yes

However, in order to be characterized as "newly developed the disease" the person has to meet the following conditions: 1. The person has to have "yes" for at least two days in a row 2. The person must have answered "no" for at least 3 days in a row before the first "yes".但是,要被定性为“新发疾病”,该人必须满足以下条件: 1. 该人必须至少连续两天“是” 2. 该人必须回答“否” ” 在第一个“是”之前至少连续 3 天。

As an Output, I would like to have the number of people fulfilling these conditions.作为 Output,我希望有多少人满足这些条件。 So in the extraction of the dataset above, this would be two (IDs 2+3).所以在上面数据集的提取中,这将是两个(ID 2+3)。

Does anybody know a way how to achieve this?有谁知道如何实现这一目标? Thanks in advance for your time!在此先感谢您的时间!

A slightly messy way of doing this would be to use the dplyr::lag() function.一个稍微混乱的方法是使用dplyr::lag() function。

 library(tidyverse)
 library(lubridate)
 df %>% 
    mutate(Date = ymd(Date)) %>%
    group_by(ID) %>% 
    mutate(day_1 = lag(Disease, 1, order_by = Date), 
           day_2 = lag(Disease, 2, order_by = Date), 
           day_3 = lag(Disease, 3, order_by = Date), 
           day_4 = lag(Disease, 4, order_by = Date)) %>% 
    filter(day_1 == "No" & day_2 == "No" & day_3 == "No" & day_4 == "Yes" &        Disease == "Yes")
    distinct(ID) %>% 
    summarise("Number of patients matching the condition" = n())

This groups the rows by ID, so all calculations are computed individually for each person.这会按 ID 对行进行分组,因此所有计算都是针对每个人单独计算的。 It then gets the value of Disease in the column the day before, the day before that, and so on for the last 4 days.然后它会在前一天、前一天等列中获取最近 4 天的疾病值。 Then, check if each row in the dataset matches the conditions.然后,检查数据集中的每一行是否符合条件。 Then take unique IDs and count them.然后获取唯一的 ID 并计算它们。

Here might be a compact way to detect for patterns in Disease column.这可能是一种检测Disease列中模式的紧凑方法。 This is based on a similar answer provided here:这是基于此处提供的类似答案:

https://stackoverflow.com/a/41131260/3460670 https://stackoverflow.com/a/41131260/3460670

Define the pattern you want (in this case, 3 "No" followed by 2 "Yes").定义您想要的模式(在这种情况下,3 个“否”后跟 2 个“是”)。 Filter the rows that meet this pattern;过滤符合此模式的行; include shift from data.table as this uses a vector for Map , instead of lead from dplyr which requires length 1 for n .包括从data.tableshift ,因为这使用了Map的向量,而不是来自dplyrlead ,这需要n的长度 1。

library(tidyverse)
library(data.table)

pattern = c("No", "No", "No", "Yes", "Yes")

df %>%
  group_by(ID) %>%
  filter(Reduce("&", Map("==", shift(Disease, n = 0:(length(pattern) - 1), type = "lead"), pattern))) %>% 
  ungroup() %>%
  summarise(Total = n_distinct(ID))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM