“平滑”時間數據-可以更高效地完成嗎？

Question

我有一個包含ID，開始日期和結束日期的數據框。 我的數據按ID，開始，結束（按此順序）排序。

現在，我希望將具有相同ID的所有行具有重疊的時間跨度（或起始日期恰好是另一行的結束日期的第二天）合並到一起。

合並它們意味着它們最終以相同的ID（最小（開始日期）和最大（結束日期））排成一行（希望您理解我的意思）。

我已經為此編寫了一個函數（尚未經過全面測試，但目前看來還不錯）。 問題是，由於我的數據框有近100.000個觀察值，因此功能非常慢。

您能幫我提高效率嗎？

這是功能

smoothingEpisodes <- function (theData) {
    theOutput <- data.frame()

    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

    theOutput
}

謝謝！

[編輯]

測試數據：

    ID      START        END
1    1 2000-01-01 2000-03-31
2    1 2000-04-01 2000-05-31
3    1 2000-04-15 2000-07-31
4    1 2000-09-01 2000-10-31
5    2 2000-01-15 2000-03-31
6    2 2000-02-01 2000-03-15
7    2 2000-04-01 2000-04-15
8    3 2000-06-01 2000-06-15
9    3 2000-07-01 2000-07-15

（START和END的數據類型為“日期”，ID為數字）

數據計算：

structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

Answer 1

我建議的第一個[無需真正思考要嘗試的工作]優化就是為theOutput分配存儲theOutput 。 目前，您正在循環的每個迭代中增加theOutput 。 在R中，這絕對是不！ 除非您不喜歡緩慢的代碼，否則您永遠不會做這件事。 R必須復制對象並在每次迭代期間將其擴展，這很慢。

查看代碼，我們知道theOutput需要具有nrow(theData) - 1行和3列。 因此，在循環開始之前創建它：

theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))

然后在循環中填寫該對象：

theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

例如。

不清楚START和END是什么？ 如果這些是數字，那么使用矩陣而不是數據幀也可以提高速度效率。

同樣，每次迭代創建一個數據幀將很慢。 我不能在不花費大量時間的情況下進行計時，但是您可以直接填寫所需的位，而無需在每次迭代期間進行data.frame()調用：

theOutput[i, "ID"] <- curId
theOutput[i, "START"] <- curStart
theOutput[i, "END"] <- curEnd

但是，我能給您的最好提示是分析您的代碼。 查看瓶頸在哪里，並加快它們的速度。 在較小的數據子集上運行函數； 它的大小足以讓您有一些運行時間來收集有用的性能分析數據，而不必等待很長時間才能完成性能分析運行。 要在R中進行剖析，請使用Rprof() ：

Rprof(filename = "my_fun_profile.Rprof")
## run your function call here on a subset of the data
Rprof(NULL)

您可以使用以下命令查看輸出

summaryRprof("my_fun_profile.Rprof")

Hadley Wickham（@hadley）提供了一個軟件包來簡化此過程。 它稱為profr 。 正如Dirk在評論中提醒我的那樣，還有Luke Tierney的proftools軟件包。

編輯：由於OP提供了一些測試數據，因此我快速完成了一些操作，以顯示通過遵循良好的循環練習而實現的加速：

smoothingEpisodes2 <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    nr <- nrow(theData)
    out1 <- integer(length = nr)
    out2 <- out3 <- numeric(length = nr)
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out1[i-1] <- curId
            out2[i-1] <- curStart
            out3[i-1] <- curEnd
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out1[i] <- curId
    out2[i] <- curStart
    out3[i] <- curEnd
    theOutput <- data.frame(ID = out1,
                            START = as.Date(out2, origin = "1970-01-01"),
                            END = as.Date(out3, origin = "1970-01-01"))
    ## drop empty
    theOutput <- theOutput[-which(theOutput$ID == 0), ]
    theOutput
}

使用對象testData提供的測試數據集，我得到：

> res1 <- smoothingEpisodes(testData)
> system.time(replicate(100, smoothingEpisodes(testData)))
   user  system elapsed 
  1.091   0.000   1.131 
> res2 <- smoothingEpisodes2(testData)
> system.time(replicate(100, smoothingEpisodes2(testData)))
   user  system elapsed 
  0.506   0.004   0.517

加快50％ 它不具有戲劇性，但僅通過在每次迭代中不增加對象就可以輕松實現。

Answer 2

為了避免最后刪除空行，我做了一些不同的操作：

smoothingEpisodes <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    theLength <- nrow(theData)

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[j] <- curId
            out.2[j] <- curStart
            out.3[j] <- curEnd

            j <- j + 1

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[j] <- curId
    out.2[j] <- curStart
    out.3[j] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

對我的原始版本有很大的改進！

Answer 3

Marcel，我想我只是想稍微改善一下您的代碼。 下面的版本大約快30倍（從3秒到0.1秒）...訣竅是首先將三列提取為整數和雙精度向量。

附帶說明一下，我嘗試使用[[在適用的情況下，並通過寫j <- j + 1L等來嘗試將整數保持為整數。這在這里沒有任何區別，但是有時在整數和雙精度之間強制使用可能會花費很多時間。時間。

smoothingEpisodes3 <- function (theData) {
    theLength <- nrow(theData)
    if (theLength < 2L) return(theData)

    id <- as.integer(theData[["ID"]])
    start <- as.numeric(theData[["START"]])
    end <- as.numeric(theData[["END"]])

    curId <- id[[1L]]
    curStart <- start[[1L]]
    curEnd <- end[[1L]]

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1L

    for(i in 2:nrow(theData)) {
        nextId <- id[[i]]
        nextStart <- start[[i]]
        nextEnd <- end[[i]]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[[j]] <- curId
            out.2[[j]] <- curStart
            out.3[[j]] <- curEnd

            j <- j + 1L

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[[j]] <- curId
    out.2[[j]] <- curStart
    out.3[[j]] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

然后，以下代碼將顯示速度差。 我只是拿走了您的數據並復制了1000次...

x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

r <- 1000
y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))

system.time( a1 <- smoothingEpisodes(y) )   # 2.95 seconds
system.time( a2 <- smoothingEpisodes3(y) )  # 0.10 seconds
all.equal( a1, a2 )

“平滑”時間數據-可以更高效地完成嗎？

問題描述

3 個解決方案

解決方案1
2 已采納 2011-06-21 13:31:12

解決方案2
1 2011-06-21 14:54:04

解決方案3
1 2011-06-21 23:30:15

“平滑”時間數據-可以更高效地完成嗎？

問題描述

3 個解決方案

解決方案1 2 已采納 2011-06-21 13:31:12

解決方案2 1 2011-06-21 14:54:04

解決方案3 1 2011-06-21 23:30:15

解決方案1
2 已采納 2011-06-21 13:31:12

解決方案2
1 2011-06-21 14:54:04

解決方案3
1 2011-06-21 23:30:15