计算每个因素按另一个因素分组的数量

Question

I know the answer to this question will be simple but I have searched the forums extensively and I have been unable to find a solution.我知道这个问题的答案很简单，但我广泛搜索了论坛，但一直找不到解决方案。

I have a column called Data_source which is a factor that I want to group my variables by.我有一个名为Data_source的列，这是我想对变量进行分组的一个因素。

I have a series of symptom* variables where I want the counts according to Data_source .我有一系列symptom*变量，我希望根据Data_source进行计数。

For some reason, I am unable to figure out how to do this.出于某种原因，我不知道该怎么做。 The normal group_by functions do not seem to work appropriately.正常的group_by函数似乎不能正常工作。

Here is the dataframe in question这是有问题的数据框

 df <- wrapr::build_frame(
   "Data_source"  , "Sex"   , "symptoms_decLOC", "symptoms_nausea_vomitting" |
     "1"          , "Female", NA_character_    , NA_character_               |
     "1"          , "Female", NA_character_    , NA_character_               |
     "1"          , "Female", "No"             , NA_character_               |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Male"  , "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", "Yes"            , "No"                        |
     "2"          , "Female", "Yes"            , "No"                        |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               )

Notice that Sex and the symptoms variables are all factors which include NA's.请注意，性别和症状变量都是包括 NA 在内的因素。 I have attempted the following我尝试了以下

df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")

Which does not work and is less than optimal because I would have to repeat it for every column.这不起作用并且不是最佳的，因为我必须为每一列重复它。 The ideal would be to use something similar to lapply(df, count) but this does not give me description for each group.理想的是使用类似于lapply(df, count)的东西，但这并没有给我每个组的描述。

EDIT编辑

In response to question below, I have added the expected output.在回答下面的问题时，我添加了预期的输出。 I have edited this in excel, color coding the group_by for clarity.我在 excel 中对此进行了编辑，为清晰起见对group_by进行了颜色编码。

Notice how I am getting a break down for each possible answer.请注意我是如何对每个可能的答案进行分解的。 When I run this using dplyr here is the output.当我使用dplyr运行它时，这里是输出。

> df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")
# A tibble: 2 x 3
# Groups:   Data_source [2]
  Data_source `"symptoms_decLOC"`     n
  <chr>       <chr>               <int>
1 1           symptoms_decLOC         5
2 2           symptoms_decLOC         2

Answer 1

This gets most of the way: haven't figured out how to include zero-count groups yet... supposedly adding .drop=FALSE takes care of this , but it's not working for me (using dplyr v. 0.8.0.9001).这得到了大部分方式：还没有想出如何包括零计数组......据说添加.drop=FALSE 会处理这个问题，但它对我不起作用（使用dplyr v. 0.8.0.9001）。

library(dplyr)
library(tidyr)
(df
    %>% tidyr::gather(var,val,-Data_source)
    %>% count(Data_source,var,val, .drop=FALSE)
    %>% na.omit()
)

Results:结果：

  Data_source var                       val        n
  <chr>       <chr>                     <chr>  <int>
1 1           Sex                       Female     7
2 1           Sex                       Male       1
3 1           symptoms_decLOC           No         1
4 1           symptoms_decLOC           Yes        5
5 1           symptoms_nausea_vomitting No         5
6 2           Sex                       Female     6
7 2           Sex                       Male       6
8 2           symptoms_decLOC           Yes        2
9 2           symptoms_nausea_vomitting No         2

Answer 2

Using @Ben Bolker's answer to get counts for each group, using spread and gather to include zero count groups.使用@Ben Bolker 的回答来获取每个组的计数，使用spread和gather来包括零计数组。

dplyr dplyr

library(dplyr)
library(tidyr)

# Count number of occurences by Data_source 
df2 <- 
  df %>% 
  gather(variable, value, -Data_source) %>% 
  count(Data_source, variable, value, name = "counter") %>%
  na.omit() 

# For variable = "Sex", leave as is
# For everything else, in this case symptom* convert into factor to include zero count group
# Then spread with dataframe will NAs filled with 0, re-convert back to long to bind rows
bind_rows(df2 %>%
            filter(variable == "Sex"), 

          df2 %>%
            filter(variable != "Sex") %>%
            mutate(value = factor(value, levels = c("Yes", "No"))) %>%
            spread(key = value, value = counter, fill = 0) %>%
            gather(value, counter, -Data_source, -variable))  %>%

  arrange(Data_source, variable)

data.table数据表

library(data.table)
dt <- data.table(df)

# Melt data by Data source
dt_melt <- melt(dt, id.vars = "Data_source", value.factor = FALSE, variable.factor = FALSE)

# Add counter, if NA then 0 else 1
dt_melt[, counter := 0]
dt_melt[!is.na(value), counter := 1]

# Sum number of occurrences
dt_count <- dt_melt[,list(counter = sum(counter)), by = c("Data_source", "variable", "value")]

# Split into two dt
dt2a <- dt_count[variable == "Sex", ]
dt2b <- dt_count[variable != "Sex" ,]

# only on symptoms variables
# Convert into factor variable
dt2b$value <- factor(dt2b$value, levels = c("Yes", "No"))
dt2b_dcast <- dcast(data = dt2b, formula = Data_source + variable ~ value, value.var = "counter", fill = 0, drop = FALSE)
dt2b_melt <- melt(dt2b_dcast, id.vars = c("Data_source", "variable"), variable.name = "value", value.name = "counter") 

# combine
combined_d <- rbind(dt2a, dt2b_melt)
combined_d[order(Data_source, variable), ]

Answer 3

I don't quite understand what you're asking, but I'll asume you want to count the number of non-NA values in each of your symptom_* columns.我不太明白你在问什么，但我假设你想计算每个symptom_*列中非 NA 值的数量。

This is a data.table solution:这是一个data.table解决方案：

# load library

library(data.table)

# Suppose the table is called "dt". Convert it to a data.table:

setDT(dt)

# convert the wide table to a long one, filter the values that
# aren't NA and count both, by Data_source and by variable
# (variable is the created column with the symptom_* names)

melt(dt, id.vars = 1:2)[!is.na(value), 
                        .N, 
                         by = .(Data_source, variable)]

What each part of the code is doing:代码的每个部分在做什么：

melt(dt, id.vars = 1:2) converts dt from wide to long, and keeps columns 1 and 2 (Data_source and sex ) as fixed. melt(dt, id.vars = 1:2)将dt从宽转换为长，并保持第 1 列和第 2 列（Data_source 和sex ）固定。

.is.na(value) filters the values (that were previously under each symptom_* header) that are not NA . .is.na(value)过滤不是NA的值（以前在每个symptom_*标头下）。

.N counts the rows. .N计算行数。

by =.(Data_source, variable) is the grouping we are using to count. by =.(Data_source, variable)是我们用来计数的分组。 variable is the name of the column where the symptom_* landed during the reshaping. variable是重塑期间symptom_*所在的列的名称。

Answer 4

Definitely, the hard thing is to keep combinations that don't exist in the data... Here is a solution in two steps:当然，困难的是保留数据中不存在的组合......这是一个分两步的解决方案：

1. Prepare a database without count 1.准备一个没有count的数据库

You can do whatever you want, but I've chosen to compute two chunks since the modalities are different for the variable Sex .你可以做任何你想做的事，但我选择计算两个块，因为变量Sex的模式不同。 No need to bind those chunks here.无需在此处绑定这些块。

chunk1 <- expand.grid(
  Data_source = c("1", "2"),
  name = c("symptoms_decLOC", "symptoms_nausea_vomitting"),
  value = c("Yes", "No"),
  stringsAsFactors = FALSE
)

chunk2 <- expand.grid(
  Data_source = c("1", "2"),
  name = "Sex",
  value = c("Female", "Male"),
  stringsAsFactors = FALSE
)

2. Finish the job asked 2.完成要求的工作

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = c("Sex", "symptoms_decLOC", "symptoms_nausea_vomitting"))%>%
  group_by(Data_source, name, value) %>%
  summarise(count = n()) %>%
  right_join(bind_rows(chunk1, chunk2), by = c("Data_source", "name", "value")) %>%
  arrange(Data_source, name) %>%
  mutate(count = zoo::na.fill(count, 0))

Et voilà瞧瞧

# A tibble: 12 x 4
# Groups:   Data_source, name [6]
   Data_source name                      value  count
   <chr>       <chr>                     <chr>  <int>
 1 1           Sex                       Female     7
 2 1           Sex                       Male       1
 3 1           symptoms_decLOC           Yes        5
 4 1           symptoms_decLOC           No         1
 5 1           symptoms_nausea_vomitting Yes        0
 6 1           symptoms_nausea_vomitting No         5
 7 2           Sex                       Female     6
 8 2           Sex                       Male       6
 9 2           symptoms_decLOC           Yes        2
10 2           symptoms_decLOC           No         0
11 2           symptoms_nausea_vomitting Yes        0
12 2           symptoms_nausea_vomitting No         2

It is not so short, but it uses simple functions.它不是那么短，但是它使用了简单的功能。 The process is similar as that one can do in Excel ie, prepare the structure and then complete the counts.该过程类似于在 Excel 中可以执行的过程，即准备结构，然后完成计数。

I hope it could help;-)我希望它能有所帮助；-)

计算每个因素按另一个因素分组的数量

问题描述

4 个解决方案

解决方案1
2 2019-04-23 22:33:22

解决方案2
1 2019-04-24 02:08:22

解决方案3
0 2019-04-17 04:26:56

解决方案4
0 2020-10-28 10:26:38

计算每个因素按另一个因素分组的数量

问题描述

4 个解决方案

解决方案1 2 2019-04-23 22:33:22

解决方案2 1 2019-04-24 02:08:22

解决方案3 0 2019-04-17 04:26:56

解决方案4 0 2020-10-28 10:26:38

解决方案1
2 2019-04-23 22:33:22

解决方案2
1 2019-04-24 02:08:22

解决方案3
0 2019-04-17 04:26:56

解决方案4
0 2020-10-28 10:26:38