数值变量和 2 因子变量的汇总统计（SAS 中的这些命令在 R 中是什么？）

Question

I am new to R - I am used to using SAS.I have a dataset with a lot of variables, where three variables are, age , sex , and agegroup .我是 R 新手 - 我习惯使用 SAS。我有一个包含很多变量的数据集，其中三个变量是age 、 sex和agegroup 。 I am trying to generate summary statistics (mean, median, Q1-Q3, sd) of the variable age , in the sex and agegroup variables.我正在尝试在sex和agegroup变量中生成变量age汇总统计数据（平均值、中位数、Q1-Q3、sd）。 Ie the summary statistics for age in females ( sex=0 ) in agegroup 1, and then agegroup 2 etc, and the same for males (sex=1).即，对于年龄在女性的汇总统计（ sex=0中） agegroup 1，然后agegroup 2等，并且相同的男性（性别= 1）。

In SAS, I would use:在 SAS 中，我会使用：

proc univariate data=mydata;  
var age;  
class agegroup;  
class sex;  
run;

What would this be in R?这在 R 中会是什么？

Also, what's equal to SAS' npar1way in R?另外，什么等于 SAS 在 R 中的npar1way ？ eg例如

proc npar1way data=mydata;  
where minutes ne 9;  
var minutes;  
class sex;  
run;`

where minutes not equal to 9 because 9 are missing values.其中分钟不等于 9，因为 9 是缺失值。 How do I do this in R?我如何在 R 中做到这一点？

Answer 1

# In R, missing values are denoted by "NA" instead of the number 9.

# save this data in a text file 
age agegroup sex
1 agegroup1 male
2 agegroup2 female
3 agegroup3 male
5 agegroup1 female
7 agegroup2 male
8 agegroup3 female
1 agegroup3 male
2 agegroup2 female
3 agegroup1 male

# Set the working directory to the location of the data file using the function 
setwd("PATH OF THE DIRECTORY")

data <- read.table("data", header=TRUE, sep=" ")
data
data$sex <- factor(data$sex, levels = c('male', 'female'), ordered=TRUE)
data$agegroup <- factor(data$agegroup, levels = c('agegroup1', 'agegroup2', 'agegroup3'), ordered=TRUE)

# Know the structure of your data
str(data)

# Summary of the data
summary(data)

# Std. Dev. of the variable "age"
std.dev.age <- sd(data$age)
std.dev.age

# Summary of three variables in a table form
table(data)

# Plot a dodged bar chart with age ~ sex + agegroup
library("ggplot2")

ggplot(data = data, aes(x = sex, y = age, ymin=0, ymax=8, fill = agegroup)) + geom_bar(position="dodge", stat="identity", width=0.50) + scale_fill_manual(values=c("red", "green", "blue")) + labs (x = "", y= "age(years)",  fill=" ")

Answer 2

You can use aggregate function in R to split the data into subsets, to compute summary statistics for each subset, and to return the result in a convenient form.您可以使用R aggregate函数将数据拆分为子集，计算每个子集的汇总统计信息，并以方便的形式返回结果。

> age <- runif(100, 20, 60)
> sex <- sample(c(0, 1), 100, replace = T)
> agegroup <- sample(1:3, 100, replace = T)
# create some data

You then can compute the quantiles for subsets grouped by sex and agegroup as然后，您可以计算按sex和agegroup组分组的子集的分位数为

> aggregate(x=age, by=list(sex=sex, agegroup=agegroup), FUN="quantile")
  sex agegroup     x.0%    x.25%    x.50%    x.75%   x.100%
1   0        1 26.70523 31.75807 37.09244 46.49449 59.77582
2   1        1 20.68903 34.49182 45.66960 48.69480 54.90620
3   0        2 20.22123 33.22948 40.57074 47.32490 58.85273
4   1        2 23.50579 31.38165 35.69254 45.13376 50.68572
5   0        3 23.46469 29.72909 42.53047 46.93867 58.30279
6   1        3 20.64256 27.22600 39.70127 48.66251 59.61565

or compute the mean或计算平均值

> aggregate(x=age, by=list(sex=sex, agegroup=agegroup), FUN="mean")
  sex agegroup        x
1   0        1 39.95470
2   1        1 41.53341
3   0        2 40.53606
4   1        2 37.32189
5   0        3 40.68784
6   1        3 38.74829

Similar for standard deviation or variance or other statistics you want to compute for each subset.与标准偏差或方差或您想要为每个子集计算的其他统计数据类似。

Answer 3

# make some test data
age <- runif(100, 20, 60)
sex <- sample(c(0, 1), 100, replace = T)
agegroup <- sample(1:3, 100, replace = T)
test <- data.frame(age,sex,agegroup)

# define a new summary function to include the SD as well
# otherwise you will just get mean,median,min,max,Q1-Q3.
newsummary <- function(x) {c(summary(x),SD=sd(x))}

# get the summary stats by each agegroup/sex combo
by(test$age,test[c("sex","agegroup")],newsummary)

Results look like this, which is an output in a list format.结果看起来像这样，这是一个列表格式的输出。

> by(test$age,test[c("sex","agegroup")],newsummary)
sex: 0
agegroup: 1
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       SD 
22.07000 27.72000 38.36000 38.41000 48.02000 54.93000 11.50681 
------------------------------------------------------------ 
sex: 1
agegroup: 1
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       SD 
24.36000 38.20000 44.96000 44.55000 52.95000 58.03000 10.70105 
------------------------------------------------------------ 
sex: 0
agegroup: 2
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       SD 
21.52000 28.54000 36.75000 38.52000 49.45000 57.12000 12.26674 
------------------------------------------------------------ 
sex: 1
agegroup: 2
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD 
20.0900 26.9900 31.7700 35.9800 44.6200 57.3500 11.9548 
------------------------------------------------------------ 
sex: 0
agegroup: 3
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD 
20.5100 30.4300 39.6300 39.4100 47.4100 57.6000 11.9816 
------------------------------------------------------------ 
sex: 1
agegroup: 3
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       SD 
20.04000 25.01000 36.03000 37.58000 47.81000 59.65000 13.14822

数值变量和 2 因子变量的汇总统计（SAS 中的这些命令在 R 中是什么？）

问题描述

3 个解决方案

解决方案1
2 2012-09-23 15:14:42

解决方案2
2 2012-09-23 15:41:17

解决方案3
1 2012-09-23 22:42:44

数值变量和 2 因子变量的汇总统计（SAS 中的这些命令在 R 中是什么？）

问题描述

3 个解决方案

解决方案1 2 2012-09-23 15:14:42

解决方案2 2 2012-09-23 15:41:17

解决方案3 1 2012-09-23 22:42:44

解决方案1
2 2012-09-23 15:14:42

解决方案2
2 2012-09-23 15:41:17

解决方案3
1 2012-09-23 22:42:44