how can I write a short script that creates a new data frame that reports the following descriptive statistics for each column of continuous data for the survey below: mean, standard deviation, median, minimum value, maximum value, sample size?
Distance Age Height Coning
1 21.4 18 3.3 Yes
2 13.9 17 3.4 Yes
3 23.9 16 2.9 Yes
4 8.7 18 3.6 No
5 241.8 6 0.7 No
6 44.5 17 1.3 Yes
7 30.0 15 2.5 Yes
8 32.3 16 1.8 Yes
9 31.4 17 5.0 No
10 32.8 13 1.6 No
11 53.3 12 2.0 No
12 54.3 6 0.9 No
13 96.3 11 2.6 No
14 133.6 4 0.6 No
15 32.1 15 2.3 No
16 57.9 12 2.4 Yes
17 30.8 17 1.8 No
18 59.9 7 0.8 No
19 42.7 15 2.0 Yes
20 20.6 18 1.7 Yes
21 62.0 8 1.3 No
22 53.1 7 1.6 No
23 28.9 16 2.2 Yes
24 177.4 5 1.1 No
25 24.8 14 1.5 Yes
26 75.3 14 2.3 Yes
27 51.6 7 1.4 No
28 36.1 9 1.1 No
29 116.1 6 1.1 No
30 28.1 16 2.5 Yes
31 8.7 19 2.2 Yes
32 105.1 6 0.8 No
33 46.0 15 3.0 Yes
34 102.6 7 1.2 No
35 15.8 15 2.2 No
36 60.0 7 1.3 No
37 96.4 13 2.6 No
38 24.2 14 1.7 No
39 14.5 15 2.4 No
40 36.6 14 1.5 No
41 65.7 5 0.6 No
42 116.3 7 1.6 No
43 113.6 8 1.0 No
44 16.7 15 4.3 Yes
45 66.0 7 1.0 No
46 60.7 7 1.0 No
47 90.6 7 0.7 No
48 91.3 7 1.3 No
49 14.4 18 3.1 Yes
50 72.8 14 3.0 Yes
You can write your own function to get such a summary into a data.frame:
# Defining the function
my.summary <- function(x, na.rm=TRUE){
result <- c(Mean=mean(x, na.rm=na.rm),
SD=sd(x, na.rm=na.rm),
Median=median(x, na.rm=na.rm),
Min=min(x, na.rm=na.rm),
Max=max(x, na.rm=na.rm),
N=length(x))
}
# identifying numeric columns
ind <- sapply(df, is.numeric)
# applying the function to numeric columns only
sapply(df[, ind], my.summary)
Distance Age Height
Mean 58.67200 11.840000 1.9160000
SD 45.48137 4.604168 0.9796626
Median 48.80000 13.500000 1.7000000
Min 8.70000 4.000000 0.6000000
Max 241.80000 19.000000 5.0000000
N 50.00000 50.000000 50.0000000
Or you can use the built-in function basicStats
from fBasics package for a more detailed summary:
> library(fBasics)
> basicStats(df[, ind])
Distance Age Height
nobs 50.000000 50.000000 50.000000
NAs 0.000000 0.000000 0.000000
Minimum 8.700000 4.000000 0.600000
Maximum 241.800000 19.000000 5.000000
1. Quartile 28.300000 7.000000 1.125000
3. Quartile 74.675000 15.750000 2.475000
Mean 58.672000 11.840000 1.916000
Median 48.800000 13.500000 1.700000
Sum 2933.600000 592.000000 95.800000
SE Mean 6.432037 0.651128 0.138545
LCL Mean 45.746337 10.531510 1.637583
UCL Mean 71.597663 13.148490 2.194417
Variance 2068.555118 21.198367 0.959739
Stdev 45.481371 4.604168 0.979663
Skewness 1.711028 -0.158853 0.905415
Kurtosis 3.753948 -1.574527 0.578684
The following use of do.call
, rbind
and sapply
provides a summary for each column that has the class 'numeric'. You can write your own statistics function if you need different statistics than those of summary
(see the answer of @Jilber).
mtcars$carb = as.factor(mtcars$carb) # Forcing one column to a factor
do.call('rbind', sapply(mtcars, function(x) if(is.numeric(x)) summary(x)))
Min. 1st Qu. Median Mean 3rd Qu. Max.
mpg 10.400 15.420 19.200 20.0900 22.80 33.900
cyl 4.000 4.000 6.000 6.1880 8.00 8.000
disp 71.100 120.800 196.300 230.7000 326.00 472.000
hp 52.000 96.500 123.000 146.7000 180.00 335.000
drat 2.760 3.080 3.695 3.5970 3.92 4.930
wt 1.513 2.581 3.325 3.2170 3.61 5.424
qsec 14.500 16.890 17.710 17.8500 18.90 22.900
vs 0.000 0.000 0.000 0.4375 1.00 1.000
am 0.000 0.000 0.000 0.4062 1.00 1.000
gear 3.000 3.000 4.000 3.6880 4.00 5.000
Here are some examples using data.table
. I'm using the functions defined in the previous answers.
my.summary <- function(x, na.rm=TRUE){
result <- c(Mean=mean(x, na.rm=na.rm),
SD=sd(x, na.rm=na.rm),
Median=median(x, na.rm=na.rm),
Min=min(x, na.rm=na.rm),
Max=max(x, na.rm=na.rm),
N=length(x))
}
set.seed(123)
df <- data.frame(id = 1:1000,
Distance = rnorm(1000, 50, 100),
Age = rnorm(1000, 50, 100),
Height = rnorm(1000, 50, 100)
)
df$Coning <- as.factor(ifelse(df$Distance > 0, "Yes", "No"))
library(fBasics)
library(data.table)
DT <- data.table(df)
setkey(DT, id)
Group by factor variable "Coning"
DT[,lapply(.SD,my.summary),by="Coning"]
Using my.summary() and basicStats() Just numeric Variables
DT[,lapply(.SD, my.summary),, .SDcols = names(DT)[2:4]]
BS <- DT[,sapply(.SD, basicStats),, .SDcols = names(DT)[2:4]]
BS[, summary := znames]
setnames(BS, 1:3, names(DT)[2:4])
BS
DT[,lapply(.SD, summary),, .SDcols = names(DT)[2:4]]
using summary() Numeric Variable using
DT[,sapply(.SD, function(x) if(is.numeric(x)) summary(x)),, .SDcols = names(DT)[2:4]]
Factor Variable
DT[,sapply(.SD, function(x) if(is.factor(x)) summary(x)),, .SDcols = names(DT)[5]]
Using the quantile function is also quite useful:
DT[,sapply(.SD, function(x) if(is.numeric(x)) quantile(x)),, .SDcols = names(DT)[2:4]]
Package collapse
provides fast and efficient summary statistics generator, qsu
. I've been looking for R functions that are similar to STATA's su
, and this one serves the best for me.
https://sebkrantz.github.io/collapse/articles/collapse_intro.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.