[英]Count number of each factor grouping by another factor
I know the answer to this question will be simple but I have searched the forums extensively and I have been unable to find a solution.我知道这个问题的答案很简单,但我广泛搜索了论坛,但一直找不到解决方案。
I have a column called Data_source
which is a factor that I want to group my variables by.我有一个名为
Data_source
的列,这是我想对变量进行分组的一个因素。
I have a series of symptom*
variables where I want the counts according to Data_source
.我有一系列
symptom*
变量,我希望根据Data_source
进行计数。
For some reason, I am unable to figure out how to do this.出于某种原因,我不知道该怎么做。 The normal
group_by
functions do not seem to work appropriately.正常的
group_by
函数似乎不能正常工作。
Here is the dataframe in question这是有问题的数据框
df <- wrapr::build_frame(
"Data_source" , "Sex" , "symptoms_decLOC", "symptoms_nausea_vomitting" |
"1" , "Female", NA_character_ , NA_character_ |
"1" , "Female", NA_character_ , NA_character_ |
"1" , "Female", "No" , NA_character_ |
"1" , "Female", "Yes" , "No" |
"1" , "Female", "Yes" , "No" |
"1" , "Female", "Yes" , "No" |
"1" , "Male" , "Yes" , "No" |
"1" , "Female", "Yes" , "No" |
"2" , "Female", NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Female", "Yes" , "No" |
"2" , "Female", "Yes" , "No" |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Female", NA_character_ , NA_character_ |
"2" , "Female", NA_character_ , NA_character_ |
"2" , "Male" , NA_character_ , NA_character_ |
"2" , "Female", NA_character_ , NA_character_ )
Notice that Sex and the symptoms variables are all factors which include NA's.请注意,性别和症状变量都是包括 NA 在内的因素。 I have attempted the following
我尝试了以下
df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")
Which does not work and is less than optimal because I would have to repeat it for every column.这不起作用并且不是最佳的,因为我必须为每一列重复它。 The ideal would be to use something similar to
lapply(df, count)
but this does not give me description for each group.理想的是使用类似于
lapply(df, count)
的东西,但这并没有给我每个组的描述。
EDIT编辑
In response to question below, I have added the expected output.在回答下面的问题时,我添加了预期的输出。 I have edited this in excel, color coding the
group_by
for clarity.我在 excel 中对此进行了编辑,为清晰起见对
group_by
进行了颜色编码。
Notice how I am getting a break down for each possible answer.请注意我是如何对每个可能的答案进行分解的。 When I run this using
dplyr
here is the output.当我使用
dplyr
运行它时,这里是输出。
> df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")
# A tibble: 2 x 3
# Groups: Data_source [2]
Data_source `"symptoms_decLOC"` n
<chr> <chr> <int>
1 1 symptoms_decLOC 5
2 2 symptoms_decLOC 2
This gets most of the way: haven't figured out how to include zero-count groups yet... supposedly adding .drop=FALSE takes care of this , but it's not working for me (using dplyr
v. 0.8.0.9001).这得到了大部分方式:还没有想出如何包括零计数组......据说添加.drop=FALSE 会处理这个问题,但它对我不起作用(使用
dplyr
v. 0.8.0.9001)。
library(dplyr)
library(tidyr)
(df
%>% tidyr::gather(var,val,-Data_source)
%>% count(Data_source,var,val, .drop=FALSE)
%>% na.omit()
)
Results:结果:
Data_source var val n
<chr> <chr> <chr> <int>
1 1 Sex Female 7
2 1 Sex Male 1
3 1 symptoms_decLOC No 1
4 1 symptoms_decLOC Yes 5
5 1 symptoms_nausea_vomitting No 5
6 2 Sex Female 6
7 2 Sex Male 6
8 2 symptoms_decLOC Yes 2
9 2 symptoms_nausea_vomitting No 2
Using @Ben Bolker's answer to get counts for each group, using spread
and gather
to include zero count groups.使用@Ben Bolker 的回答来获取每个组的计数,使用
spread
和gather
来包括零计数组。
dplyr dplyr
library(dplyr)
library(tidyr)
# Count number of occurences by Data_source
df2 <-
df %>%
gather(variable, value, -Data_source) %>%
count(Data_source, variable, value, name = "counter") %>%
na.omit()
# For variable = "Sex", leave as is
# For everything else, in this case symptom* convert into factor to include zero count group
# Then spread with dataframe will NAs filled with 0, re-convert back to long to bind rows
bind_rows(df2 %>%
filter(variable == "Sex"),
df2 %>%
filter(variable != "Sex") %>%
mutate(value = factor(value, levels = c("Yes", "No"))) %>%
spread(key = value, value = counter, fill = 0) %>%
gather(value, counter, -Data_source, -variable)) %>%
arrange(Data_source, variable)
data.table数据表
library(data.table)
dt <- data.table(df)
# Melt data by Data source
dt_melt <- melt(dt, id.vars = "Data_source", value.factor = FALSE, variable.factor = FALSE)
# Add counter, if NA then 0 else 1
dt_melt[, counter := 0]
dt_melt[!is.na(value), counter := 1]
# Sum number of occurrences
dt_count <- dt_melt[,list(counter = sum(counter)), by = c("Data_source", "variable", "value")]
# Split into two dt
dt2a <- dt_count[variable == "Sex", ]
dt2b <- dt_count[variable != "Sex" ,]
# only on symptoms variables
# Convert into factor variable
dt2b$value <- factor(dt2b$value, levels = c("Yes", "No"))
dt2b_dcast <- dcast(data = dt2b, formula = Data_source + variable ~ value, value.var = "counter", fill = 0, drop = FALSE)
dt2b_melt <- melt(dt2b_dcast, id.vars = c("Data_source", "variable"), variable.name = "value", value.name = "counter")
# combine
combined_d <- rbind(dt2a, dt2b_melt)
combined_d[order(Data_source, variable), ]
I don't quite understand what you're asking, but I'll asume you want to count the number of non-NA values in each of your symptom_*
columns.我不太明白你在问什么,但我假设你想计算每个
symptom_*
列中非 NA 值的数量。
This is a data.table
solution:这是一个
data.table
解决方案:
# load library
library(data.table)
# Suppose the table is called "dt". Convert it to a data.table:
setDT(dt)
# convert the wide table to a long one, filter the values that
# aren't NA and count both, by Data_source and by variable
# (variable is the created column with the symptom_* names)
melt(dt, id.vars = 1:2)[!is.na(value),
.N,
by = .(Data_source, variable)]
What each part of the code is doing:代码的每个部分在做什么:
melt(dt, id.vars = 1:2)
converts dt
from wide to long, and keeps columns 1 and 2 (Data_source and sex
) as fixed. melt(dt, id.vars = 1:2)
将dt
从宽转换为长,并保持第 1 列和第 2 列(Data_source 和sex
)固定。
.is.na(value)
filters the values (that were previously under each symptom_*
header) that are not NA
. .is.na(value)
过滤不是NA
的值(以前在每个symptom_*
标头下)。
.N
counts the rows. .N
计算行数。
by =.(Data_source, variable)
is the grouping we are using to count. by =.(Data_source, variable)
是我们用来计数的分组。 variable
is the name of the column where the symptom_*
landed during the reshaping. variable
是重塑期间symptom_*
所在的列的名称。
Definitely, the hard thing is to keep combinations that don't exist in the data... Here is a solution in two steps:当然,困难的是保留数据中不存在的组合......这是一个分两步的解决方案:
1. Prepare a database without count 1.准备一个没有count的数据库
You can do whatever you want, but I've chosen to compute two chunks since the modalities are different for the variable Sex
.你可以做任何你想做的事,但我选择计算两个块,因为变量
Sex
的模式不同。 No need to bind those chunks here.无需在此处绑定这些块。
chunk1 <- expand.grid(
Data_source = c("1", "2"),
name = c("symptoms_decLOC", "symptoms_nausea_vomitting"),
value = c("Yes", "No"),
stringsAsFactors = FALSE
)
chunk2 <- expand.grid(
Data_source = c("1", "2"),
name = "Sex",
value = c("Female", "Male"),
stringsAsFactors = FALSE
)
2. Finish the job asked 2.完成要求的工作
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = c("Sex", "symptoms_decLOC", "symptoms_nausea_vomitting"))%>%
group_by(Data_source, name, value) %>%
summarise(count = n()) %>%
right_join(bind_rows(chunk1, chunk2), by = c("Data_source", "name", "value")) %>%
arrange(Data_source, name) %>%
mutate(count = zoo::na.fill(count, 0))
Et voilà瞧瞧
# A tibble: 12 x 4
# Groups: Data_source, name [6]
Data_source name value count
<chr> <chr> <chr> <int>
1 1 Sex Female 7
2 1 Sex Male 1
3 1 symptoms_decLOC Yes 5
4 1 symptoms_decLOC No 1
5 1 symptoms_nausea_vomitting Yes 0
6 1 symptoms_nausea_vomitting No 5
7 2 Sex Female 6
8 2 Sex Male 6
9 2 symptoms_decLOC Yes 2
10 2 symptoms_decLOC No 0
11 2 symptoms_nausea_vomitting Yes 0
12 2 symptoms_nausea_vomitting No 2
It is not so short, but it uses simple functions.它不是那么短,但是它使用了简单的功能。 The process is similar as that one can do in Excel ie, prepare the structure and then complete the counts.
该过程类似于在 Excel 中可以执行的过程,即准备结构,然后完成计数。
I hope it could help;-)我希望它能有所帮助;-)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.