简体   繁体   English

在R中的长data.frames上有效使用函数

[英]Efficient use of functions on long data.frames in R

I have a long data frame that contains meteorological data from a mast. 我有一个长数据框,其中包含来自桅杆的气象数据。 It contains observations ( data$value ) taken at the same time of different parameters (wind speed, direction, air temperature, etc., in data$param ) at different heights ( data$z ) 它包含在不同高度( data$z )的不同参数(风速,方向,气温等, data$param )的同时拍摄的观测值( data$value

I am trying to efficiently slice this data by $time , and then apply functions to all of the data collected. 我试图有效地将这些数据切片$time ,然后将函数应用于所有收集的数据。 Usually functions are applied to a single $param at a time (ie I apply different functions to wind speed than I do to air temperature). 通常,函数一次应用于单个$param (即,我对风速应用不同的函数而不是空气温度)。

Current approach 目前的做法

My current method is to use data.frame and ddply . 我目前的方法是使用data.frameddply

If I want to get all of the wind speed data, I run this: 如果我想获得所有风​​速数据,我运行:

# find good data ----
df <- data[((data$param == "wind speed") &
                  !is.na(data$value)),]

I then run my function on df using ddply() : 然后我使用ddply()df上运行我的函数:

df.tav <- ddply(df,
               .(time),
               function(x) {
                      y <-data.frame(V1 = sum(x$value) + sum(x$z),
                                     V2 = sum(x$value) / sum(x$z))
                      return(y)
                    })

Usually V1 and V2 are calls to other functions. 通常V1和V2是对其他功能的调用。 These are just examples. 这些只是一些例子。 I do need to run multiple functions on the same data though. 我确实需要在相同的数据上运行多个函数。

Question

My current approach is very slow. 我目前的方法慢。 I have not benchmarked it, but it's slow enough I can go get a coffee and come back before a year's worth of data has been processed. 我没有对它进行基准测试,但它足够慢,我可以去喝咖啡,然后在一年的数据处理之前回来。

I have order(hundred) towers to process, each with a year of data and 10-12 heights and so I am looking for something faster. 我有订单(百)塔要处理,每个都有一年的数据和10-12个高度,所以我正在寻找更快的东西。

Data sample 数据样本

data <-  structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0, 
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160, 
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60, 
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature", 
"humidity", "barometric pressure", "wind direction", "turbulence", 
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"temperature", "barometric pressure", "humidity", "wind direction", 
"wind speed", "turbulence", "wind direction"), value = c(-2.5, 
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32, 
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35, 
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9, 
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time", 
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")

Use data.table : 使用data.table

library(data.table)
dt = data.table(data)

setkey(dt, param)  # sort by param to look it up fast

dt[J('wind speed')][!is.na(value),
                    list(sum(value) + sum(z), sum(value)/sum(z)),
                    by = time]
#                  time      V1         V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00  102.18 0.02180000

If you want to apply a different function for each param, here's a more uniform approach for that. 如果你想为每个参数应用不同的函数,这里有一个更统一的方法。

# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]

# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
                 fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
                 key = 'p')
fns
                p     fn
1: wind direction <call>    # the fn column contains functions
2:     wind speed <call>    # i.e. this is getting fancy!

# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
   by = list(param, time)]
#            param                time           V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3:     wind speed 2009-12-31 18:10:00 4.209735e-02
#4:     wind speed 2009-12-31 18:20:00 2.180000e-02

PS I think the fact that I have to use param in some way before eval for eval to work is a bug. PS我认为我必须在eval之前以某种方式使用param才能使eval工作是一个错误。


UPDATE: As of version 1.8.11 this bug has been fixed and the following works: 更新:版本1.8.11开始,此错误已得到修复,以下工作:

dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]

Use dplyr. 使用dplyr。 It's still in development, but it's much much faster than plyr: 它仍在开发中,但它比plyr快得多:

# devtools::install_github(dplyr)
library(dplyr)

windspeed <- subset(data, param == "wind speed")
daily <- group_by(windspeed, time)

summarise(daily, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

The other advantage of dplyr is that you can use a data table as a backend, without having to know anything about data.table's special syntax: dplyr的另一个优点是你可以使用数据表作为后端,而无需了解data.table的特殊语法:

library(data.table)
daily_dt <- group_by(data.table(windspeed), time)
summarise(daily_dt, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

(dplyr with a data frame is 20-100x faster than plyr, and dplyr with a data.table is about another 10x faster). (带有数据帧的dplyr比plyr快20-100倍,带有data.table的dplyr大约快10倍)。 dplyr is nowhere near as concise as data.table, but it has a function for each major task of data analysis, which I find makes the code easier to understand - you speed almost be able to read a sequence of dplyr operations to someone else and have them understand what's going on. dplyr远不如data.table那么简洁,但是它具有数据分析的每个主要任务的功能,我发现这使得代码更容易理解 - 你几乎能够读取一系列dplyr操作给别人和让他们了解发生了什么。

If you want to do different summaries per variable, I recommend changing your data structure to be " tidy ": 如果您想对每个变量进行不同的汇总,我建议您将数据结构更改为“ 整洁 ”:

library(reshape2)
data_tidy <- dcast(data, ... ~ param)

daily_tidy <- group_by(data_tidy, time)
summarise(daily_tidy, 
  mean.pressure = mean(`barometric pressure`, na.rm = TRUE),
  sd.turbulence = sd(`barometric pressure`, na.rm = TRUE)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM