[英]Grouping data for multiple variables by a single variable
In the following dataset, I want to do two things在下面的数据集中,我想做两件事
pt_id <- c(1,1,1,1,1,2,2,2,3,3,3,3,3,4,4,4,4)
Tobacco <- c("once","twice","never", NA, NA, NA, NA, NA,"Once","Twice","Quit","Once",NA,NA,"Never", NA, "Never")
Alcohol <- c("twice", "once",NA, NA, "never", NA, NA, "Once", NA, "Quit", "Twice", NA, "Once", NA, NA, "Never", "Never")
PA <- c("once",NA,"never", NA, NA, NA, NA, NA,"Once",NA,"Quit","Once",NA,NA,"Never", NA, NA)
mydata <- data.frame(pt_id, Tobacco, Alcohol, PA)
mydata
I used the following code to get my output but I can do it only for one variable at a time.我使用以下代码来获取我的 output 但我一次只能为一个变量执行此操作。
mydata_tob <- mydata %>%
filter(!is.na(Tobacco)) %>%
group_by(pt_id) %>%
count()
# A tibble: 3 x 2
# Groups: pt_id [3]
pt_id n
<dbl> <int>
1 1 3
2 3 4
3 4 2
But this is very time-consuming for me as I have many many variables in my original dataset.但这对我来说非常耗时,因为我的原始数据集中有很多变量。 I want a similar kind of output for all the variables in one go.
对于一个 go 中的所有变量,我想要一种类似的 output。
gt1_prop <- function(n) {
gt1_len <- length(mydata_tob$n[mydata_tob$n > 1])
len_tot <- length(mydata_tob$n)
gt1_prop <- (gt1_len/ len_tot)*100
return(gt1_prop)
}
Again I want to code in a way that I get the proportion for each variable (Tobacco, Alcohol and PA) in the dataset.同样,我想以一种获得数据集中每个变量(烟草、酒精和 PA)的比例的方式进行编码。
Any suggestions will be helpful.任何建议都会有所帮助。 Thanks in advance!
提前致谢!
To count number of non-NA values for each pt_id
you can use across
.要计算每个
pt_id
的非 NA 值across
数量,您可以使用 cross 。
library(dplyr)
mydata %>%
group_by(pt_id) %>%
summarise(across(Tobacco:PA, ~sum(!is.na(.)))) -> result
result
# pt_id Tobacco Alcohol PA
# <dbl> <int> <int> <int>
#1 1 3 3 2
#2 2 0 1 0
#3 3 4 3 3
#4 4 2 2 1
For 2nd step to calculate the percentage you can do:对于计算百分比的第二步,您可以执行以下操作:
result %>%
summarise(across(Tobacco:PA, ~mean(. > 1) * 100))
# Tobacco Alcohol PA
# <dbl> <dbl> <dbl>
#1 0.75 0.75 0.5
In base R
, we can do在
base R
中,我们可以做
aggregate(.~ pt_id, mydata, FUN = function(x) sum(!is.na(x)), na.action = NULL)
-output -输出
# pt_id Tobacco Alcohol PA
#1 1 3 3 2
#2 2 0 1 0
#3 3 4 3 3
#4 4 2 2 1
Or more compactly with rowsum
from base R
或者更紧凑地使用来自
base R
rowsum
rowsum
rowsum(+(!is.na(mydata[-1])), mydata$pt_id)
# Tobacco Alcohol PA
#1 3 3 2
#2 0 1 0
#3 4 3 3
#4 2 2 1
If we need the percentages如果我们需要百分比
colMeans(rowsum(+(!is.na(mydata[-1])), mydata$pt_id) > 1)
#Tobacco Alcohol PA
# 0.75 0.75 0.50
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.