简体   繁体   English

数值变量和 2 因子变量的汇总统计(SAS 中的这些命令在 R 中是什么?)

[英]Summary statistics of numerical and 2 factor variables (what would these commands in SAS be in R?)

I am new to R - I am used to using SAS.I have a dataset with a lot of variables, where three variables are, age , sex , and agegroup .我是 R 新手 - 我习惯使用 SAS。我有一个包含很多变量的数据集,其中三个变量是agesexagegroup I am trying to generate summary statistics (mean, median, Q1-Q3, sd) of the variable age , in the sex and agegroup variables.我正在尝试在sexagegroup变量中生成变量age汇总统计数据(平均值、中位数、Q1-Q3、sd)。 Ie the summary statistics for age in females ( sex=0 ) in agegroup 1, and then agegroup 2 etc, and the same for males (sex=1).即,对于年龄在女性的汇总统计( sex=0中) agegroup 1,然后agegroup 2等,并且相同的男性(性别= 1)。

In SAS, I would use:在 SAS 中,我会使用:

proc univariate data=mydata;  
var age;  
class agegroup;  
class sex;  
run;

What would this be in R?这在 R 中会是什么?

Also, what's equal to SAS' npar1way in R?另外,什么等于 SAS 在 R 中的npar1way eg例如

proc npar1way data=mydata;  
where minutes ne 9;  
var minutes;  
class sex;  
run;`  

where minutes not equal to 9 because 9 are missing values.其中分钟不等于 9,因为 9 是缺失值。 How do I do this in R?我如何在 R 中做到这一点?

# In R, missing values are denoted by "NA" instead of the number 9.

# save this data in a text file 
age agegroup sex
1 agegroup1 male
2 agegroup2 female
3 agegroup3 male
5 agegroup1 female
7 agegroup2 male
8 agegroup3 female
1 agegroup3 male
2 agegroup2 female
3 agegroup1 male

# Set the working directory to the location of the data file using the function 
setwd("PATH OF THE DIRECTORY")

data <- read.table("data", header=TRUE, sep=" ")
data
data$sex <- factor(data$sex, levels = c('male', 'female'), ordered=TRUE)
data$agegroup <- factor(data$agegroup, levels = c('agegroup1', 'agegroup2', 'agegroup3'), ordered=TRUE)

# Know the structure of your data
str(data)

# Summary of the data
summary(data)

# Std. Dev. of the variable "age"
std.dev.age <- sd(data$age)
std.dev.age

# Summary of three variables in a table form
table(data)

# Plot a dodged bar chart with age ~ sex + agegroup
library("ggplot2")

ggplot(data = data, aes(x = sex, y = age, ymin=0, ymax=8, fill = agegroup)) + geom_bar(position="dodge", stat="identity", width=0.50) + scale_fill_manual(values=c("red", "green", "blue")) + labs (x = "", y= "age(years)",  fill=" ")

You can use aggregate function in R to split the data into subsets, to compute summary statistics for each subset, and to return the result in a convenient form.您可以使用R aggregate函数将数据拆分为子集,计算每个子集的汇总统计信息,并以方便的形式返回结果。

> age <- runif(100, 20, 60)
> sex <- sample(c(0, 1), 100, replace = T)
> agegroup <- sample(1:3, 100, replace = T)
# create some data

You then can compute the quantiles for subsets grouped by sex and agegroup as然后,您可以计算按sexagegroup组分组的子集的分位数为

> aggregate(x=age, by=list(sex=sex, agegroup=agegroup), FUN="quantile")
  sex agegroup     x.0%    x.25%    x.50%    x.75%   x.100%
1   0        1 26.70523 31.75807 37.09244 46.49449 59.77582
2   1        1 20.68903 34.49182 45.66960 48.69480 54.90620
3   0        2 20.22123 33.22948 40.57074 47.32490 58.85273
4   1        2 23.50579 31.38165 35.69254 45.13376 50.68572
5   0        3 23.46469 29.72909 42.53047 46.93867 58.30279
6   1        3 20.64256 27.22600 39.70127 48.66251 59.61565

or compute the mean或计算平均值

> aggregate(x=age, by=list(sex=sex, agegroup=agegroup), FUN="mean")
  sex agegroup        x
1   0        1 39.95470
2   1        1 41.53341
3   0        2 40.53606
4   1        2 37.32189
5   0        3 40.68784
6   1        3 38.74829

Similar for standard deviation or variance or other statistics you want to compute for each subset.与标准偏差或方差或您想要为每个子集计算的其他统计数据类似。

# make some test data
age <- runif(100, 20, 60)
sex <- sample(c(0, 1), 100, replace = T)
agegroup <- sample(1:3, 100, replace = T)
test <- data.frame(age,sex,agegroup)

# define a new summary function to include the SD as well
# otherwise you will just get mean,median,min,max,Q1-Q3.
newsummary <- function(x) {c(summary(x),SD=sd(x))}

# get the summary stats by each agegroup/sex combo
by(test$age,test[c("sex","agegroup")],newsummary)

Results look like this, which is an output in a list format.结果看起来像这样,这是一个列表格式的输出。

> by(test$age,test[c("sex","agegroup")],newsummary)
sex: 0
agegroup: 1
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       SD 
22.07000 27.72000 38.36000 38.41000 48.02000 54.93000 11.50681 
------------------------------------------------------------ 
sex: 1
agegroup: 1
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       SD 
24.36000 38.20000 44.96000 44.55000 52.95000 58.03000 10.70105 
------------------------------------------------------------ 
sex: 0
agegroup: 2
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       SD 
21.52000 28.54000 36.75000 38.52000 49.45000 57.12000 12.26674 
------------------------------------------------------------ 
sex: 1
agegroup: 2
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD 
20.0900 26.9900 31.7700 35.9800 44.6200 57.3500 11.9548 
------------------------------------------------------------ 
sex: 0
agegroup: 3
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.      SD 
20.5100 30.4300 39.6300 39.4100 47.4100 57.6000 11.9816 
------------------------------------------------------------ 
sex: 1
agegroup: 3
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       SD 
20.04000 25.01000 36.03000 37.58000 47.81000 59.65000 13.14822 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM