將 data.frame 折疊成 data.frame——by() 和 aggregate() 的問題

Question

考慮到我有以下數據和 function 返回我喜歡的匯總統計信息

landlines <- data.frame(
                year=rep(c(1990,1995,2000,2005,2010),times=3),
                country=rep(c("US", "Brazil", "Asia"), each=5),
                pct =  c(0.99, 0.99, 0.98, 0.05, 0.9,
                         0.4,  0.5,  0.55, 0.5,  0.45,
                         0.7,  0.85, 0.9,  0.85, 0.75)
                )
someStats <- function(x)
{
  dp <- as.matrix(x$pct)-mean(x$pct)
  indp <- as.matrix(x$year)-mean(x$year)
  f <- lm.fit( indp,dp )$coefficients
  w <- sd(x$pct)
  m <- min(x$pct)
  results <- c(f,w,m)
  names(results) <- c("coef","sdev", "minPct")
  results
}

我可以像這樣成功地將 function 應用於數據子集：

> someStats(landlines[landlines$country=="US",])
      coef      sdev    minPct 
 -0.022400  0.410938  0.050000

或者像這樣按國家/地區查看細分：

> by(landlines, list(country=landlines$country), someStats)
country: Asia
      coef       sdev     minPct 
0.00200000 0.08215838 0.70000000 
--------------------------------------------------------------------------------------- 
country: Brazil
      coef       sdev     minPct 
0.00200000 0.05700877 0.40000000 
--------------------------------------------------------------------------------------- 
country: US
     coef      sdev    miPct 
-0.022400  0.410938  0.050000

麻煩的是，這不是我需要進一步處理的data.frame object，它不會這樣轉換：

> as.data.frame( by(landlines, list(country=landlines$country), someStats) )
Error in as.data.frame.default(by(landlines, list(country = landlines$country),  : 
  cannot coerce class '"by"' into a data.frame

“沒問題，”我認為，因為類似的aggregate() function 確實返回了一個data.frame ：

> aggregate(landlines$pct, by=list(country=landlines$country), min)
  country    x
1    Asia 0.70
2  Brazil 0.40
3      US 0.05

問題是，它不能與任意函數一起正常工作：

> aggregate(landlines, by=list(country=landlines$country), someStats)
Error in x$pct : $ operator is invalid for atomic vectors

我真正想要得到的是一個包含以下列的data.frame object：

國家
系數
sdev
最小值

我怎樣才能做到這一點？

Answer 1

看看plyr package 尤其是ddply

> ddply(landlines, .(country), someStats)
  country    coef       sdev minPct
1    Asia  0.0020 0.08215838   0.70
2  Brazil  0.0020 0.05700877   0.40
3      US -0.0224 0.41093795   0.05

理想情況下，您的 function 顯式返回一個data.frame ，但在這種情況下，可以輕松正確地將其強制轉換為一個。

Answer 2

by對象實際上是列表，因此您可以在do.call中使用rbind ：

do.call("rbind",by(landlines, list(country=landlines$country), someStats))
          coef       sdev minPct
Asia    0.0020 0.08215838   0.70
Brazil  0.0020 0.05700877   0.40
US     -0.0224 0.41093795   0.05

Answer 3

aggregate是為不同的目的而設計的。 你想要的是lapply(split()) ：

> lapply( split(landlines, list(country=landlines$country)), FUN=someStats)
$Asia
      coef       sdev     minPct 
0.00200000 0.08215838 0.70000000 

$Brazil
      coef       sdev     minPct 
0.00200000 0.05700877 0.40000000 

$US
     coef      sdev    minPct 
-0.022400  0.410938  0.050000

如果 output 將是可預測的常規值，則最好使用 sapply：

> sapply( split(landlines, list(country=landlines$country)), FUN=someStats)
             Asia     Brazil        US
coef   0.00200000 0.00200000 -0.022400
sdev   0.08215838 0.05700877  0.410938
minPct 0.70000000 0.40000000  0.050000

添加了使用行名中的值構建第一列的演示：

> ttbl <- as.data.frame(t(tbl))
> ttbl <- cbind(Country=rownames(ttbl), ttbl)
> ttbl
       Country    coef       sdev minPct
Asia      Asia  0.0020 0.08215838   0.70
Brazil  Brazil  0.0020 0.05700877   0.40
US          US -0.0224 0.41093795   0.05

將 data.frame 折疊成 data.frame——by() 和 aggregate() 的問題

問題描述

3 個解決方案

解決方案1
4 已采納 2012-04-04 14:58:53

解決方案2
4 2012-04-04 15:27:53

解決方案3
3 2012-04-04 15:11:44

將 data.frame 折疊成 data.frame——by() 和 aggregate() 的問題

問題描述

3 個解決方案

解決方案1 4 已采納 2012-04-04 14:58:53

解決方案2 4 2012-04-04 15:27:53

解決方案3 3 2012-04-04 15:11:44

解決方案1
4 已采納 2012-04-04 14:58:53

解決方案2
4 2012-04-04 15:27:53

解決方案3
3 2012-04-04 15:11:44