[英]How can I make this R matrix filling function faster?
前段時間我寫了一個函數來填充時間序列矩陣,它根據所需的規范得到了NA值,並且它在幾個大約50000行,350列的矩陣上有其他用途。 矩陣可以包含數字或字符值。 主要的問題是修復矩陣很慢,我想我會評估一些專家如何更快地完成這項工作。
我想去rcpp或者並行它可能有所幫助,但我認為這可能是我的設計而不是R本身效率低下。 我通常在R中對所有內容進行處理,但由於缺失的值沒有遵循任何模式,我發現除了基於每行的矩陣之外沒有別的辦法。
需要調用該函數,以便它可以攜帶前向缺失值,並且還可以調用以快速填充最后一個已知值的最新值。
這是一個示例矩陣:
testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA,
4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA,
NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA,
66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA,
-12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA,
NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39,
NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA,
4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L,
9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e",
"ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31",
"2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31",
"2011-04-30", "2011-05-31")))
print(testMatrix)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA
這是我目前使用的功能:
# ----------------------------------------------------------------------------
# GetMatrixWithBlanksFilled
# ----------------------------------------------------------------------------
#
# Arguments:
# inputMatrix --- A matrix with gaps in the time series rows
# fillGapMax --- The max number of columns to carry a number
# forward if there are no more values in the
# time series row.
#
# Returns:
# A matrix with gaps filled.
GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) {
if("DEBUG_ON" %in% ls(globalenv())){browser()}
cntRow <- nrow(inputMatrix)
cntCol <- ncol(inputMatrix)
#
if (forwardLooking) {
for (i in 1:cntRow) {
# Store the location of the first non NA element in the row
firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1]
if (!(is.na(firstValueCol))) {
if (!(firstValueCol == cntCol)) {
nextValueCol <- firstValueCol
# If there is a a value number in the row and it's not at the end of the time
# series, start iterating through the row while there are more NA values and
# more data values and not at the end of the row continue.
while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) {
# Find the next NA element
nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1]
# Find the next value element
nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1]
# If there is another value element then fill up all NA elements in between with the last known value
if (!is.na(nextValueCol)) {
inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)]
} else {
# If there is no other value element then fill up all NA elements up to the max number supplied
# with the last known value unless it's close to the end of the row then just fill up to the end.
inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)]
nextValueCol <- cntCol
}
}
}
}
}
} else {
for (i in 1:cntRow) {
if (is.na(inputMatrix[i,ncol(inputMatrix)])) {
tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])]
if (length(tempRow[!is.na(tempRow)])>0) {
lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
}
}
}
}
return(inputMatrix)
}
然后我用以下的方式調用它:
> fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE)
> print(fixedMatrix1)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39
要么
> fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE)
> print(fixedMatrix2)
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 4.39
這個例子運行得很快,但有沒有辦法讓它對大型矩陣更快?
> n <- 38
> m <- 5000
> bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE)
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
user system elapsed
86.47 0.06 87.24
這個虛擬的有很多NA行和完全填充的,但正常的可能需要大約15-20分鍾。
UPDATE
關於Charles關於na.locf的評論沒有完全反映上述邏輯:下面是最終函數如何排除輸入檢查等的簡化版本:
FillGaps <- function( dataMatrix, fillGapMax ) {
require("zoo")
numRow <- nrow(dataMatrix)
numCol <- ncol(dataMatrix)
iteration <- (numCol-fillGapMax)
if(length(iteration)>0) {
for (i in iteration:1) {
tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE]
tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax))
dataMatrix[,i:(i+fillGapMax)] <- tempMatrix
}
}
return(dataMatrix)
}
我可能錯了,但我認為這是在zoo包中實現的:使用na.locf
函數。
使用給定的示例矩陣,首先我們應該轉置它,並在調用na
函數后,我們'重新轉換'結果矩陣。 例如:
> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=12))
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98 29.98
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 4.03 4.76 4.76 4.76 4.39 4.39 4.39 4.39
並且有小的maxgap
:
> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=0))
2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a NA NA NA NA NA NA NA NA NA
ID_b NA NA NA NA NA NA NA NA NA
ID_c NA NA NA NA NA NA NA NA 4.72
ID_d 29.98 29.98 29.98 NA 29.98 29.98 29.98 29.98 NA
ID_e 66.89 66.89 66.89 66.89 66.89 66.89 66.89 50.65 50.65
ID_f NA NA NA NA NA NA NA NA NA
ID_g -12.78 -12.78 -12.78 -12.78 -12.78 -12.78 -10.72 -10.72 -10.72
ID_h -11.65 -11.65 NA NA NA NA -11.65 -11.65 -38.61
ID_i NA NA NA NA NA NA NA NA 45.30
ID_j 4.03 NA 4.76 4.76 4.76 4.39 4.39 4.39 NA
使用na.locf
獲得的性能可以看出:
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
user system elapsed
79.238 0.540 80.398
> system.time(output <- t(na.locf(t(bigM), na.rm=FALSE, maxgap=12)))
user system elapsed
17.129 0.267 17.513
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.