[英]Summary statistics of numerical and 2 factor variables (what would these commands in SAS be in R?)
I am new to R - I am used to using SAS.I have a dataset with a lot of variables, where three variables are, age
, sex
, and agegroup
.我是 R 新手 - 我习惯使用 SAS。我有一个包含很多变量的数据集,其中三个变量是
age
、 sex
和agegroup
。 I am trying to generate summary statistics (mean, median, Q1-Q3, sd) of the variable age
, in the sex
and agegroup
variables.我正在尝试在
sex
和agegroup
变量中生成变量age
汇总统计数据(平均值、中位数、Q1-Q3、sd)。 Ie the summary statistics for age in females ( sex=0
) in agegroup
1, and then agegroup
2 etc, and the same for males (sex=1).即,对于年龄在女性的汇总统计(
sex=0
中) agegroup
1,然后agegroup
2等,并且相同的男性(性别= 1)。
In SAS, I would use:在 SAS 中,我会使用:
proc univariate data=mydata;
var age;
class agegroup;
class sex;
run;
What would this be in R?这在 R 中会是什么?
Also, what's equal to SAS' npar1way
in R?另外,什么等于 SAS 在 R 中的
npar1way
? eg例如
proc npar1way data=mydata;
where minutes ne 9;
var minutes;
class sex;
run;`
where minutes not equal to 9 because 9 are missing values.其中分钟不等于 9,因为 9 是缺失值。 How do I do this in R?
我如何在 R 中做到这一点?
# In R, missing values are denoted by "NA" instead of the number 9.
# save this data in a text file
age agegroup sex
1 agegroup1 male
2 agegroup2 female
3 agegroup3 male
5 agegroup1 female
7 agegroup2 male
8 agegroup3 female
1 agegroup3 male
2 agegroup2 female
3 agegroup1 male
# Set the working directory to the location of the data file using the function
setwd("PATH OF THE DIRECTORY")
data <- read.table("data", header=TRUE, sep=" ")
data
data$sex <- factor(data$sex, levels = c('male', 'female'), ordered=TRUE)
data$agegroup <- factor(data$agegroup, levels = c('agegroup1', 'agegroup2', 'agegroup3'), ordered=TRUE)
# Know the structure of your data
str(data)
# Summary of the data
summary(data)
# Std. Dev. of the variable "age"
std.dev.age <- sd(data$age)
std.dev.age
# Summary of three variables in a table form
table(data)
# Plot a dodged bar chart with age ~ sex + agegroup
library("ggplot2")
ggplot(data = data, aes(x = sex, y = age, ymin=0, ymax=8, fill = agegroup)) + geom_bar(position="dodge", stat="identity", width=0.50) + scale_fill_manual(values=c("red", "green", "blue")) + labs (x = "", y= "age(years)", fill=" ")
You can use aggregate
function in R
to split the data into subsets, to compute summary statistics for each subset, and to return the result in a convenient form.您可以使用
R
aggregate
函数将数据拆分为子集,计算每个子集的汇总统计信息,并以方便的形式返回结果。
> age <- runif(100, 20, 60)
> sex <- sample(c(0, 1), 100, replace = T)
> agegroup <- sample(1:3, 100, replace = T)
# create some data
You then can compute the quantiles for subsets grouped by sex
and agegroup
as然后,您可以计算按
sex
和agegroup
组分组的子集的分位数为
> aggregate(x=age, by=list(sex=sex, agegroup=agegroup), FUN="quantile")
sex agegroup x.0% x.25% x.50% x.75% x.100%
1 0 1 26.70523 31.75807 37.09244 46.49449 59.77582
2 1 1 20.68903 34.49182 45.66960 48.69480 54.90620
3 0 2 20.22123 33.22948 40.57074 47.32490 58.85273
4 1 2 23.50579 31.38165 35.69254 45.13376 50.68572
5 0 3 23.46469 29.72909 42.53047 46.93867 58.30279
6 1 3 20.64256 27.22600 39.70127 48.66251 59.61565
or compute the mean或计算平均值
> aggregate(x=age, by=list(sex=sex, agegroup=agegroup), FUN="mean")
sex agegroup x
1 0 1 39.95470
2 1 1 41.53341
3 0 2 40.53606
4 1 2 37.32189
5 0 3 40.68784
6 1 3 38.74829
Similar for standard deviation or variance or other statistics you want to compute for each subset.与标准偏差或方差或您想要为每个子集计算的其他统计数据类似。
# make some test data
age <- runif(100, 20, 60)
sex <- sample(c(0, 1), 100, replace = T)
agegroup <- sample(1:3, 100, replace = T)
test <- data.frame(age,sex,agegroup)
# define a new summary function to include the SD as well
# otherwise you will just get mean,median,min,max,Q1-Q3.
newsummary <- function(x) {c(summary(x),SD=sd(x))}
# get the summary stats by each agegroup/sex combo
by(test$age,test[c("sex","agegroup")],newsummary)
Results look like this, which is an output in a list format.结果看起来像这样,这是一个列表格式的输出。
> by(test$age,test[c("sex","agegroup")],newsummary)
sex: 0
agegroup: 1
Min. 1st Qu. Median Mean 3rd Qu. Max. SD
22.07000 27.72000 38.36000 38.41000 48.02000 54.93000 11.50681
------------------------------------------------------------
sex: 1
agegroup: 1
Min. 1st Qu. Median Mean 3rd Qu. Max. SD
24.36000 38.20000 44.96000 44.55000 52.95000 58.03000 10.70105
------------------------------------------------------------
sex: 0
agegroup: 2
Min. 1st Qu. Median Mean 3rd Qu. Max. SD
21.52000 28.54000 36.75000 38.52000 49.45000 57.12000 12.26674
------------------------------------------------------------
sex: 1
agegroup: 2
Min. 1st Qu. Median Mean 3rd Qu. Max. SD
20.0900 26.9900 31.7700 35.9800 44.6200 57.3500 11.9548
------------------------------------------------------------
sex: 0
agegroup: 3
Min. 1st Qu. Median Mean 3rd Qu. Max. SD
20.5100 30.4300 39.6300 39.4100 47.4100 57.6000 11.9816
------------------------------------------------------------
sex: 1
agegroup: 3
Min. 1st Qu. Median Mean 3rd Qu. Max. SD
20.04000 25.01000 36.03000 37.58000 47.81000 59.65000 13.14822
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.