使用 dplyr 在长格式数据上计算因子水平和数值的唯一出现次数

Question

I have data on repeated measurements of 8 patients, each with varying amount of repeated measurements on the same variables.我有 8 位患者重复测量的数据，每个患者对相同变量的重复测量量不同。 The measured variables are sex, blood pressure (sys_bp), and how many CT scans a person underwent:测量的变量是性别、血压 (sys_bp) 以及一个人接受的 CT 扫描次数：

library(dplyr)
library(magrittr)

questiondata <- structure(list(id = c(2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 
4, 7, 7, 8, 8, 8, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 20, 
20, 20), time = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 4L, 5L, 1L, 6L, 1L, 2L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L, 1L, 2L, 4L), .Label = c("T0", "T1M0", "T1M6", 
"T1M12", "T2M0", "FU1"), class = "factor"), sys_bp = c(116, 125.8, 
NA, NA, NA, 113.2, NA, NA, NA, NA, 146, NA, NA, NA, NA, NA, NA, 
125, NA, NA, 164.5, NA, NA, NA, NA, 150.5, NA, NA, NA, NA, 158, 
NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("female", "male"), class = "factor"), 
    ct_amount = c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 2L, 2L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 3L, 3L, 3L)), row.names = c(NA, -32L), class = c("tbl_df", 
"tbl", "data.frame"))

questiondata

      id time  sys_bp sex    ct_amount
   <dbl> <fct>  <dbl> <fct>      <int>
 1     2 T0      116  female         4
 2     2 T1M0    126. female         4
 3     2 T1M6     NA  female         4
 4     2 T1M12    NA  female         4
 5     3 T0       NA  female         5
 6     3 T1M0    113. female         5
 7     3 T1M6     NA  female         5
 8     3 T1M12    NA  female         5
 9     3 T2M0     NA  female         5
10     4 T0       NA  male           5
11     4 T1M0    146  male           5
12     4 T1M6     NA  male           5
13     4 T1M12    NA  male           5
14     4 T2M0     NA  male           5
15     7 T0       NA  female         2
16     7 FU1      NA  female         2
17     8 T0       NA  female         3
18     8 T1M0    125  female         3
19     8 T2M0     NA  female         3
20    13 T0       NA  female         5
21    13 T1M0    164. female         5
22    13 T1M6     NA  female         5
23    13 T1M12    NA  female         5
24    13 T2M0     NA  female         5
25    14 T0       NA  male           5
26    14 T1M0    150. male           5
27    14 T1M6     NA  male           5
28    14 T1M12    NA  male           5
29    14 T2M0     NA  male           5
30    20 T0       NA  female         3
31    20 T1M0    158  female         3
32    20 T1M12    NA  female         3

I am trying to count the number of persons that (1) is male/female (2) has 1/2/3/4/5 CT scans.我正在尝试计算 (1) 是男性/女性 (2) 进行 1/2/3/4/5 次 CT 扫描的人数。

So the output would be that there are (1) 6 females and 2 males, and (2) 1 person with 2 CTs, 2 persons with 3 CTs, 1 person with 4 CTs and 4 persons with 5 CTs.因此输出将是 (1) 6 名女性和 2 名男性，以及 (2) 1 个人有 2 个 CT，2 个人有 3 个 CT，1 个人有 4 个 CT，4 个人有 5 个 CT。

I've tried many combinations of group_by and summarise and count , but can't seem to get it right.我试过的许多组合group_by和summarise ，并count ，但似乎无法得到它的权利。 Any help?有什么帮助吗？

Answer 1

You can first keep only the unique rows for each id .您可以首先只保留每个id的唯一行。 Then use count to get the output.然后使用count得到输出。

library(dplyr)

unique_data <- questiondata %>% distinct(id, .keep_all = TRUE)

unique_data %>% count(sex)
# A tibble: 2 x 2
#  sex        n
#  <fct>  <int>
#1 female     6
#2 male       2

unique_data %>% count(ct_amount)

# A tibble: 4 x 2
#  ct_amount     n
#      <int> <int>
#1         2     1
#2         3     2
#3         4     1
#4         5     4

Answer 2

We could use duplicated with filter我们可以使用带有filter duplicated

library(dplyr)
questiondata %>%
     filter(!duplicated(id)) %>%
     count(ct_amount)
# A tibble: 4 x 2
  ct_amount     n
      <int> <int>
1         2     1
2         3     2
3         4     1
4         5     4

使用 dplyr 在长格式数据上计算因子水平和数值的唯一出现次数

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-07-15 11:28:18

解决方案2
1 2021-07-15 16:51:09

使用 dplyr 在长格式数据上计算因子水平和数值的唯一出现次数

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-07-15 11:28:18

解决方案2 1 2021-07-15 16:51:09

解决方案1
3 已采纳 2021-07-15 11:28:18

解决方案2
1 2021-07-15 16:51:09