简体   繁体   English

通过单个变量对多个变量的数据进行分组

[英]Grouping data for multiple variables by a single variable

In the following dataset, I want to do two things在下面的数据集中,我想做两件事

pt_id <- c(1,1,1,1,1,2,2,2,3,3,3,3,3,4,4,4,4)
Tobacco <- c("once","twice","never", NA, NA, NA, NA, NA,"Once","Twice","Quit","Once",NA,NA,"Never", NA, "Never")
Alcohol <- c("twice", "once",NA, NA, "never", NA, NA, "Once", NA, "Quit", "Twice", NA, "Once", NA, NA, "Never", "Never")
PA <- c("once",NA,"never", NA, NA, NA, NA, NA,"Once",NA,"Quit","Once",NA,NA,"Never", NA, NA)
mydata <- data.frame(pt_id, Tobacco, Alcohol, PA)
mydata
  1. Count the number of rows per patient that are not NA for each variable (Tobacco, alcohol and PA) grouped by patient ID.对于按患者 ID 分组的每个变量(烟草、酒精和 PA),计算每个患者不为 NA 的行数。

I used the following code to get my output but I can do it only for one variable at a time.我使用以下代码来获取我的 output 但我一次只能为一个变量执行此操作。

mydata_tob <- mydata %>% 
  filter(!is.na(Tobacco)) %>% 
  group_by(pt_id) %>% 
  count()

# A tibble: 3 x 2
# Groups:   pt_id [3]
  pt_id     n
  <dbl> <int>
1     1     3
2     3     4
3     4     2

But this is very time-consuming for me as I have many many variables in my original dataset.但这对我来说非常耗时,因为我的原始数据集中有很多变量。 I want a similar kind of output for all the variables in one go.对于一个 go 中的所有变量,我想要一种类似的 output。

  1. My end result is I want to calculate the percentage of pt_id with more than 1 entry for each variable.我的最终结果是我想计算每个变量超过 1 个条目的 pt_id 百分比。 I created the following function (only for tobacco) to be able to do so我创建了以下 function (仅适用于烟草)能够这样做
gt1_prop <- function(n) {
  gt1_len <- length(mydata_tob$n[mydata_tob$n > 1])
  len_tot <- length(mydata_tob$n)
  gt1_prop <- (gt1_len/ len_tot)*100
  return(gt1_prop)
}

Again I want to code in a way that I get the proportion for each variable (Tobacco, Alcohol and PA) in the dataset.同样,我想以一种获得数据集中每个变量(烟草、酒精和 PA)的比例的方式进行编码。

Any suggestions will be helpful.任何建议都会有所帮助。 Thanks in advance!提前致谢!

To count number of non-NA values for each pt_id you can use across .要计算每个pt_id的非 NA 值across数量,您可以使用 cross 。

library(dplyr)

mydata %>%
  group_by(pt_id) %>%
  summarise(across(Tobacco:PA, ~sum(!is.na(.)))) -> result
result

#  pt_id Tobacco Alcohol    PA
#  <dbl>   <int>   <int> <int>
#1     1       3       3     2
#2     2       0       1     0
#3     3       4       3     3
#4     4       2       2     1

For 2nd step to calculate the percentage you can do:对于计算百分比的第二步,您可以执行以下操作:

result %>%
  summarise(across(Tobacco:PA, ~mean(. > 1) * 100))

#  Tobacco Alcohol    PA
#    <dbl>   <dbl> <dbl>
#1    0.75    0.75   0.5

In base R , we can dobase R中,我们可以做

aggregate(.~ pt_id, mydata, FUN = function(x) sum(!is.na(x)), na.action = NULL)

-output -输出

#   pt_id Tobacco Alcohol PA
#1     1       3       3  2
#2     2       0       1  0
#3     3       4       3  3
#4     4       2       2  1

Or more compactly with rowsum from base R或者更紧凑地使用来自base R rowsum rowsum

rowsum(+(!is.na(mydata[-1])), mydata$pt_id)
#  Tobacco Alcohol PA
#1       3       3  2
#2       0       1  0
#3       4       3  3
#4       2       2  1

If we need the percentages如果我们需要百分比

colMeans(rowsum(+(!is.na(mydata[-1])), mydata$pt_id) > 1)
#Tobacco Alcohol      PA 
#   0.75    0.75    0.50 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM