简体   繁体   中英

Performing a count of each level of a factor grouping by another factor

I would like a dataframe output where the count 2 of 4 levels ("Yes" and "No") of a variable are recorded. I can do it by subsetting and filtering on yes or no but I feel there must be a better way to do this with dplyr

null.ta <- dbdata %>%
filter(MutGroup == "Null") %>%
group_by(ICD_Grouping) %>%
summarise(n()) %>%
spread(???????)

Above is what I assume I have to do to an extent but do not know how to get the spread function to work for this particular variable. I don't mind if all 4 levels are included then I can just cut a couple columns after the fact.

structure(list(ICD_Grouping = structure(c(50L, 50L, 33L, 33L, 
50L, 50L, 50L, 18L, 21L, 33L, 18L, 18L, 50L, 50L, 50L, 17L, 17L, 
17L, 17L, 17L, 17L, 50L, 50L, 50L, 50L, 18L, 18L, 16L, 50L, 50L, 
50L, 16L, 17L, 50L, 50L, 50L, 16L, 16L, 30L, 50L, 50L, 16L, 18L, 
17L, 50L, 50L, 50L, 50L, 50L, 50L, 21L, 30L, 21L, 18L, 21L, 21L, 
13L, 30L, 50L, 50L, 50L, 50L, 13L, 34L, 33L, 18L, 16L, 16L, 16L, 
16L, 18L, 10L, 34L, 37L, 34L, 34L, 18L, 33L, 33L, 18L, 18L, 37L, 
50L, 30L, 30L, 50L, 50L, 50L, 50L, 50L, 50L, 34L, 34L, 33L, 17L, 
14L, 19L, 33L, 18L, 18L, 18L, 50L, 30L, 30L, 30L, 34L, 18L, 18L, 
18L, 18L, 30L, 30L, 17L, 17L, 33L), .Label = c("", "C01-2", "C03-6", 
"C09-10", "C11", "C15", "C16", "C18-20", "C21", "C22", "C25", 
"C30-31", "C33-34", "C37-39", "C40-41", "C43", "C44", "C45", 
"C47/49", "C48", "C50", "C51", "C53", "C54-55", "C56", "C57-58", 
"C60", "C61", "C62", "C64", "C65-66/68", "C67", "C69", "C70", 
"C71", "C72", "C73", "C74-75", "C76.0", "C76.2", "C76.3", "C80", 
"C81", "C82-86", "C90.0", "C91.0", "C94.3/95", "D04", "D05", 
"D22", "D31", "D33", "D35"), class = "factor"), Immunohistochemistry = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 3L, 3L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 2L, 2L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 
2L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 2L, 4L, 2L, 4L, 4L, 4L, 4L, 3L, 
3L, 4L), .Label = c("", "N/A", "No", "Yes"), class = "factor")), row.names = c(NA, 
-115L), class = "data.frame")

And I would like an output that would look like

ICD_Grouping Yes No N/A
C22           2   1   0
C45           7   3   1
C69           4   0   0

That is an example of random data, not this data. Would just like a data frame with the counts of each factor level in Immunohistochemistry by ICD_Grouping.

If I understand correctly, we can just do that with base table :

table(dbdata)

table will show results for each level (even if it's no longer present in the data), so to make the table reasonably sized, we use droplevels to remove unused levels first:

table(droplevels(dbdata))

            Immunohistochemistry
ICD_Grouping N/A No Yes
      C22      0  0   1
      C33-34   0  0   2
      C37-39   1  0   0
      C43      0  2   7
      C44      1  2   8
      C45      2  0  17
      C47/49   1  0   0
      C50      0  1   4
      C64      0  0  10
      C69      7  0   2
      C70      1  0   6
      C73      0  1   1
      D22      8  0  30

A table can be converted to a data.frame with the same structure using:

table(droplevels(dbdata)) %>%
    as.data.frame.matrix() %>%
    tibble::rownames_to_column('ICD_Grouping')

or if you like pipes:

dbdata %>%
    droplevels() %>%
    table() %>%
    as.data.frame.matrix() %>%
    tibble::rownames_to_column('ICD_Grouping')

Both give the same data.frame as a result:

   ICD_Grouping N/A No Yes
1           C22   0  0   1
2        C33-34   0  0   2
3        C37-39   1  0   0
4           C43   0  2   7
5           C44   1  2   8
6           C45   2  0  17
7        C47/49   1  0   0
8           C50   0  1   4
9           C64   0  0  10
10          C69   7  0   2
11          C70   1  0   6
12          C73   0  1   1
13          D22   8  0  30

This form is a proper data frame that can be used in any downstream processes, or joined on the ICD_Grouping variable

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM