R 中以下变量均值的均值、标准差和 95% 置信区间

Question

I need to create a summary table that shows the mean, standard deviation and 95% confidence interval for the mean of the following variables: Selling Price, Number of bedrooms, Size of house, Distance from city centre.我需要创建一个汇总表，显示以下变量的平均值、标准差和 95% 置信区间：售价、卧室数量、房屋大小、与市中心的距离。

I have a file with data.我有一个包含数据的文件。

ID Price Bedrooms Size Pool Distance Suburbs Garage
1  1   300        2  124    0      8.6       1      0
2  2   340        2  142    0     10.3       1      0
3  3   280        2  145    0     17.5       4      1
4  4   340        2  139    0      7.9       1      0
5  5   310        2  155    0     10.9       4      1
6  6   320        2  134    0      5.8       3      1
mydata <- read.csv("Real_Estate.csv")
head(mydata)
dfo <- data.frame(mydata)
dto <- data.table(dfo)
result_1 <- dto[, sapply(.SD, function(x) list(mean = mean(x)))]
result_2 <- dto[, sapply(.SD, function(x) list(sd = sd(x)))]

But I haven't idea how to calculate 95% CI and create summary table但我不知道如何计算 95% CI 并创建汇总表

Answer 1

Here's a reproducible tidyverse example that lets you create a summary table这是一个可重现的tidyverse示例，可让您创建汇总表

library(tidyverse)

df <- tibble(
  ID = 1:100,
  price = round(rnorm(100, mean = 500, sd = 50)),
  bedrooms = sample(1:4, 100, replace = T)
)

df %>%
  pivot_longer(cols = c(price, bedrooms),
               names_to = "variable",
               values_to = "value") %>%
  group_by(variable) %>%
  summarize(mean = mean(value),
            sd = sd(value),
            se = sd / sqrt(n()),
            CI_lower = mean - (1.96 * se),
            CI_upper = mean + (1.96 * se))

Answer 2

you can have two approaches;你可以有两种方法； You can use the below link to understand how you can do it by calculating SD, SE, giving degree of freedom etc. & at the end calculating the CI https://bookdown.org/logan_kelly/r_practice/p09.html您可以使用下面的链接来了解如何通过计算 SD、SE、给出自由度等来了解如何做到这一点。最后计算 CI https://bookdown.org/logan_kelly/r_practice/p09.html

Or you can use directly packages available to do it.或者您可以直接使用可用的软件包来执行此操作。 like Rmisc pacakge by the confidence interval mentioned.就像提到的置信区间的 Rmisc pacakge 一样。

install.packages("Rmisc")
library(Rmisc)
mydata<-iris
CI(mydata$Sepal.Length, ci=0.95)

At the end as a tip you can use psych package to have this kind of summary.最后作为提示，您可以使用 psych package 进行此类总结。

install.packages("psych")
library('psych')
describe(mydata)

It provides,它提供，

number of valid cases, mean, standard deviation, trimmed mean (with trim defaulting to.1), median, mad: median absolute deviation (from the median),minimum, maximum, skew, kurtosis, standard error有效案例数，平均值，标准差，修剪平均值（修剪默认为.1），中位数，疯狂：中位数绝对偏差（与中位数），最小值，最大值，偏斜，峰度，标准误差

Answer 3

A data.table solution is the following. data.table解决方案如下。

library(data.table)

ci <- function(x, conf = 0.95, na.rm = FALSE){
  xbar <- mean(x, na.rm = na.rm)
  s <- sd(x, na.rm = na.rm)
  p <- c((1 - conf)/2, 1 - (1 - conf)/2)
  qq <- qnorm(p, mean = xbar, sd = s)
  setNames(qq, c("lower", "upper"))
}
stats <- function(x, na.rm = FALSE){
  CI <- ci(x, na.rm = na.rm)
  c(
    Mean = mean(x, na.rm = na.rm),
    SD = sd(x, na.rm = na.rm),
    Lower = CI[1],
    Upper = CI[2]
  )
}


df1 <- as.data.table(df1)

df1[, lapply(.SD, stats), .SDcols = c("Price", "Size", "Distance")]
#       Price      Size  Distance
#1: 315.00000 139.83333 10.166667
#2:  23.45208  10.45785  4.024757
#3: 269.03477 119.33632  2.278288
#4: 360.96523 160.33035 18.055045

Data数据

df1 <- read.table(text = "
ID Price Bedrooms Size Pool Distance Suburbs Garage
1  1   300        2  124    0      8.6       1      0
2  2   340        2  142    0     10.3       1      0
3  3   280        2  145    0     17.5       4      1
4  4   340        2  139    0      7.9       1      0
5  5   310        2  155    0     10.9       4      1
6  6   320        2  134    0      5.8       3      1
", header = TRUE)

Answer 4

You can also use skimr but creating functions for the upper and lower CIs and then dropping any statistics you don't want by setting them to NULL.您也可以使用skimr，但为上下CIs创建函数，然后通过将它们设置为NULL来删除您不想要的任何统计信息。

library(skimr)
lower <- function(x ){Rmisc::CI(x)["lower"]}
upper <- function(x ){Rmisc::CI(x)["upper"]}
myskim <- skim_with(numeric = sfl(mean = mean, sd = sd, lower =  lower, 
                                  upper = upper), base = NULL,
                                  append =  FALSE)
myskim(mtcars)

R 中以下变量均值的均值、标准差和 95% 置信区间

问题描述

4 个解决方案

解决方案1
1 2021-06-05 18:01:26

解决方案2
0 2021-06-05 17:40:45

解决方案3
0 2021-06-05 18:09:05

Data数据

解决方案4
0 2021-06-05 18:54:20

R 中以下变量均值的均值、标准差和 95% 置信区间

问题描述

4 个解决方案

解决方案1 1 2021-06-05 18:01:26

解决方案2 0 2021-06-05 17:40:45

解决方案3 0 2021-06-05 18:09:05

Data数据

解决方案4 0 2021-06-05 18:54:20

解决方案1
1 2021-06-05 18:01:26

解决方案2
0 2021-06-05 17:40:45

解决方案3
0 2021-06-05 18:09:05

解决方案4
0 2021-06-05 18:54:20