確定 R 數據框中缺失值的行的百分比

Question

我有一個數據框，其中包含三個變量和其中一個變量中的一些缺失值，如下所示：

subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)

df1 <- data.frame(subject,part,sad)

我使用循環創建了一個新的數據框，每個主題和部分的平均值為“悲傷”，如下所示：

columns<-c("sad.m",
           "part", 
           "subject")

df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns

tn<-unique(df1$subject)

row=1

for (s in tn){
  for (i in 0:3){
    TN<-df1[df1$subject==s&df1$part==i,]
    df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
    df2[row,"part"]<-i 
    df2[row,"subject"]<-s 
    row=row+1
  }
  
}

現在我想包含一個額外的變量“missing”，它表示每個主題的行數和缺失值部分的百分比，以便我得到 df3：

subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)

df3 <- data.frame(subject,part,sad.m,missing)

我非常感謝有關如何解決此問題的任何幫助！

Answer 1

最好盡可能避免 R 中的循環，因為它們會變得混亂並且往往很慢。 對於這種事情，dplyr 庫是完美的，非常值得學習。 它可以為您節省很多時間。

您可以先按主題和部分分組，然后對分組的數據框進行匯總，從而創建包含兩個變量的數據框：

df2 = df1 %>% 
    dplyr::group_by(subject, part) %>%
    dplyr::summarise(
        sad_mean = mean(na.omit(sad)),
        na_count = (sum(is.na(sad) / n()) * 100)
    )

df2
# A tibble: 8 x 4
# Groups:   subject [2]
  subject  part sad_mean na_count
    <dbl> <dbl>    <dbl>    <dbl>
1       1     0     4.75        0
2       1     1     2          50
3       1     2     2.5        50
4       1     3     1.67       25
5       2     0     5.5        50
6       2     1     4.5        50
7       2     2     4          50
8       2     3     4          25

Answer 2

對於每個subject和part您可以使用is.na和mean計算sad mean並計算 NA 值的比率。

library(dplyr)
df1 %>%
  group_by(subject, part) %>%
  summarise(sad.m = mean(sad, na.rm = TRUE), 
            perc_missing = mean(is.na(sad)) * 100)

#  subject  part sad.m perc_missing
#    <dbl> <dbl> <dbl>        <dbl>
#1       1     0  4.75            0
#2       1     1  2              50
#3       1     2  2.5            50
#4       1     3  1.67           25
#5       2     0  5.5            50
#6       2     1  4.5            50
#7       2     2  4              50
#8       2     3  4              25

與data.table相同的邏輯：

library(data.table)

setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE), 
               perc_missing = mean(is.na(sad)) * 100), .(subject, part)]

Answer 3

試試這個dplyr方法來計算df3 ：

library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))

輸出：

# A tibble: 8 x 3
# Groups:   subject [2]
  subject  part     N
    <dbl> <dbl> <dbl>
1       1     0     0
2       1     1    50
3       1     2    50
4       1     3    25
5       2     0    50
6       2     1    50
7       2     2    50
8       2     3    25

對於與df2完全交互，您可以使用left_join() ：

#Left join
df3 <- df1 %>% group_by(subject,part) %>%
  summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
  left_join(df2)

輸出：

# A tibble: 8 x 4
# Groups:   subject [2]
  subject  part     N sad.m
    <dbl> <dbl> <dbl> <dbl>
1       1     0     0  4.75
2       1     1    50  2   
3       1     2    50  2.5 
4       1     3    25  1.67
5       2     0    50  5.5 
6       2     1    50  4.5 
7       2     2    50  4   
8       2     3    25  4

確定 R 數據框中缺失值的行的百分比

問題描述

3 個解決方案

解決方案1
4 已采納 2020-10-03 14:01:45

解決方案2
2 2020-10-03 14:04:13

解決方案3
1 2020-10-03 13:59:01

確定 R 數據框中缺失值的行的百分比

問題描述

3 個解決方案

解決方案1 4 已采納 2020-10-03 14:01:45

解決方案2 2 2020-10-03 14:04:13

解決方案3 1 2020-10-03 13:59:01

解決方案1
4 已采納 2020-10-03 14:01:45

解決方案2
2 2020-10-03 14:04:13

解決方案3
1 2020-10-03 13:59:01