如何替換數據集中的缺失點？

Question

我想在R中編寫一個函數來接收任何數據集作為輸入，這樣數據集就有一些缺失點（ NA ）。 現在我想使用mean函數來替換數據集中缺失點（ NA ）的一些數字/值。 我在想的是這樣的功能：

x<function(data,type=c("mean", lag=2))

實際上，它應該計算缺失點之前的兩個數字和之前兩個數字的平均值（因為我認為函數中的lag為2 ）。 例如，如果缺失點位於第 12 位，則該函數應計算第 10、11、13 和 14 位數字的平均值，並將結果替換為第 12 位缺失點。 在特定情況下，例如，如果缺失點在最后一個位置，並且我們后面沒有兩個數字，則該函數應計算相應列的所有數據的平均值並替換缺失點。 這里我舉個例子說明一下。 考慮以下數據集：

3  7 8 0  8  12 2
5  8 9 2  8  9  1
1  2 4 5  0  6  7
5  6 0 NA 3  9  10
7  2 3 6  11 14 2
4  8 7 4  5  3  NA

在上面的數據集中，第一個NA應該替換為數字2 、 5 （前兩個數據）和6和4 （后兩個數據）的均值，即(2+5+6+4)/4等於到17/4 。 最后一個NA應該替換為最后一列的平均值，即(2+1+7+10+2)/5等於22/5 。

我的問題是如何在上面的函數中添加一些代碼（ if 、 if-else或其他循環）來制作一個完整的函數來滿足上面的解釋。 我應該強調我想使用apply函數系列。

Answer 1

首先，我們可以定義一個平滑單個向量的函數：

library(dplyr)

smooth = function(vec, n=2){
    # Lead and lag the vector twice in both directions
    purrr::map(1:n, function(i){
        cbind(
            lead(vec, i),
            lag(vec, i)
        )
    }) %>%
        # Bind the matrix together
        do.call(cbind, .) %>%
        # Take the mean of each row, ie the smoothed version at each position
        # If there are NAs in the mean, it will itself be NA
        rowMeans() %>%
        # In order, take a) original values b) locally smoothed values
        # c) globally smoothed values (ie the entire mean ignoring NAs)
        coalesce(vec, ., mean(vec, na.rm=TRUE))
}

> smooth(c(0, 2, 5, NA, 6, 4))
[1] 0.00 2.00 5.00 4.25 6.00 4.00
> smooth(c(2, 1, 7, 10, 2, NA))
[1]  2.0  1.0  7.0 10.0  2.0  4.4

然后我們可以將它應用到每一列：

> c(3, 7, 8, 0, 8, 12, 2, 5, 8, 9, 2, 8, 9, 1, 1, 2, 4, 5, 0, 6, 7, 5, 6, 0, NA, 3, 9, 10, 7, 2, 3, 6, 11, 14, 2, 4, 8, 7, 4, 5, 3, NA) %>% 
    matrix(byrow=TRUE, ncol=7) %>%
    as_tibble(.name_repair="universal") %>%                        
    mutate(across(everything(), smooth))
# A tibble: 6 × 7
   ...1  ...2  ...3  ...4  ...5  ...6  ...7
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     3     7     8  0        8    12   2  
2     5     8     9  2        8     9   1  
3     1     2     4  5        0     6   7  
4     5     6     0  4.25     3     9  10  
5     7     2     3  6       11    14   2  
6     4     8     7  4        5     3   4.4

Answer 2

請使用data.table庫找到以下一種解決方案。

正品

您的數據：

m1 <- "3  7 8 0  8  12 2
       5  8 9 2  8  9  1
       1  2 4 5  0  6  7
       5  6 0 NA 3  9  10
       7  2 3 6  11 14 2
       4  8 7 4  5  3  NA"

myData<- read.table(text=m1,h=F)

函數replaceNA代碼

library(data.table)

replaceNA <- function(data){
  
  setDT(data)
  
  # Create a data.table identifying rows and cols indexes of NA values in the data.table
  NA_DT <- as.data.table(which(is.na(data), arr.ind=TRUE))
  
  # Select row and column indexes of NAs that are not at the last row in the data.table
  NA_not_Last <- NA_DT[row < nrow(data)]
  
  # Select row and column indexes of NA that is at the last row in the data.table
  NA_Last <- NA_DT[row == nrow(data)]
  
  # Create a vector of column names where NA values are not at the last row in the data.table
  Cols_NA_not_Last <- colnames(data)[NA_not_Last[,col]]
  
  # Create a vector of column names where NA values are at the last row in the data.table
  Cols_NA_Last <- colnames(data)[NA_Last[,col]]
  
  # Replace NA values that are not at the last row in the data.table by the mean of the values located 
  # in the two previous lines and the two following lines of the line containing the NA value
  data[, (Cols_NA_not_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(c(x[which(is.na(x))-2], x[which(is.na(x))-1], x[which(is.na(x))+1], x[which(is.na(x))+2]), na.rm = TRUE))), .SDcols = Cols_NA_not_Last][]
  
  # Replace NA values that are at the last row in the data.table by the mean of all the values in the column where the NA value is found 
  data[, (Cols_NA_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(x, na.rm = TRUE))), .SDcols = Cols_NA_Last][]

  return(data)
}

使用您的數據測試功能

replaceNA(myData)
#>    V1 V2 V3   V4 V5 V6   V7
#> 1:  3  7  8 0.00  8 12  2.0
#> 2:  5  8  9 2.00  8  9  1.0
#> 3:  1  2  4 5.00  0  6  7.0
#> 4:  5  6  0 4.25  3  9 10.0
#> 5:  7  2  3 6.00 11 14  2.0
#> 6:  4  8  7 4.00  5  3  4.4

^{由reprex 包(v2.0.1) 於 2021 年 11 月 8 日創建}

如何替換數據集中的缺失點？

問題描述

2 個解決方案

解決方案1
0 2021-11-07 14:43:08

解決方案2
0 2021-11-07 23:46:54

如何替換數據集中的缺失點？

問題描述

2 個解決方案

解決方案1 0 2021-11-07 14:43:08

解決方案2 0 2021-11-07 23:46:54

解決方案1
0 2021-11-07 14:43:08

解決方案2
0 2021-11-07 23:46:54