簡體   English   中英

在R中的長data.frames上有效使用函數

[英]Efficient use of functions on long data.frames in R

我有一個長數據框,其中包含來自桅桿的氣象數據。 它包含在不同高度( data$z )的不同參數(風速,方向,氣溫等, data$param )的同時拍攝的觀測值( data$value

我試圖有效地將這些數據切片$time ,然后將函數應用於所有收集的數據。 通常,函數一次應用於單個$param (即,我對風速應用不同的函數而不是空氣溫度)。

目前的做法

我目前的方法是使用data.frameddply

如果我想獲得所有風​​速數據,我運行:

# find good data ----
df <- data[((data$param == "wind speed") &
                  !is.na(data$value)),]

然后我使用ddply()df上運行我的函數:

df.tav <- ddply(df,
               .(time),
               function(x) {
                      y <-data.frame(V1 = sum(x$value) + sum(x$z),
                                     V2 = sum(x$value) / sum(x$z))
                      return(y)
                    })

通常V1和V2是對其他功能的調用。 這些只是一些例子。 我確實需要在相同的數據上運行多個函數。

我目前的方法慢。 我沒有對它進行基准測試,但它足夠慢,我可以去喝咖啡,然后在一年的數據處理之前回來。

我有訂單(百)塔要處理,每個都有一年的數據和10-12個高度,所以我正在尋找更快的東西。

數據樣本

data <-  structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0, 
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160, 
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60, 
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature", 
"humidity", "barometric pressure", "wind direction", "turbulence", 
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"temperature", "barometric pressure", "humidity", "wind direction", 
"wind speed", "turbulence", "wind direction"), value = c(-2.5, 
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32, 
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35, 
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9, 
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time", 
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")

使用data.table

library(data.table)
dt = data.table(data)

setkey(dt, param)  # sort by param to look it up fast

dt[J('wind speed')][!is.na(value),
                    list(sum(value) + sum(z), sum(value)/sum(z)),
                    by = time]
#                  time      V1         V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00  102.18 0.02180000

如果你想為每個參數應用不同的函數,這里有一個更統一的方法。

# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]

# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
                 fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
                 key = 'p')
fns
                p     fn
1: wind direction <call>    # the fn column contains functions
2:     wind speed <call>    # i.e. this is getting fancy!

# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
   by = list(param, time)]
#            param                time           V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3:     wind speed 2009-12-31 18:10:00 4.209735e-02
#4:     wind speed 2009-12-31 18:20:00 2.180000e-02

PS我認為我必須在eval之前以某種方式使用param才能使eval工作是一個錯誤。


更新:版本1.8.11開始,此錯誤已得到修復,以下工作:

dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]

使用dplyr。 它仍在開發中,但它比plyr快得多:

# devtools::install_github(dplyr)
library(dplyr)

windspeed <- subset(data, param == "wind speed")
daily <- group_by(windspeed, time)

summarise(daily, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

dplyr的另一個優點是你可以使用數據表作為后端,而無需了解data.table的特殊語法:

library(data.table)
daily_dt <- group_by(data.table(windspeed), time)
summarise(daily_dt, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

(帶有數據幀的dplyr比plyr快20-100倍,帶有data.table的dplyr大約快10倍)。 dplyr遠不如data.table那么簡潔,但是它具有數據分析的每個主要任務的功能,我發現這使得代碼更容易理解 - 你幾乎能夠讀取一系列dplyr操作給別人和讓他們了解發生了什么。

如果您想對每個變量進行不同的匯總,我建議您將數據結構更改為“ 整潔 ”:

library(reshape2)
data_tidy <- dcast(data, ... ~ param)

daily_tidy <- group_by(data_tidy, time)
summarise(daily_tidy, 
  mean.pressure = mean(`barometric pressure`, na.rm = TRUE),
  sd.turbulence = sd(`barometric pressure`, na.rm = TRUE)
)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM