简体   繁体   English

确保R中的时间数据密度

[英]Ensuring temporal data density in R

ISSUE --------- 问题 - - - - -

I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). 我有数千个时间序列文件(.csv),其中包含20到50年之间的间歇数据(请参阅df)。 Each file contains the date_time and a metric (temperature). 每个文件都包含date_time和一个度量标准(温度)。 The data is hourly and where no measurement exists there is an 'NA'. 数据是每小时一次,没有测量值的地方会显示“ NA”。

>df
date_time         temp 
01/05/1943 11:00  5.2
01/05/1943 12:00  5.2
01/05/1943 13:00  5.8
01/05/1943 14:00   NA
01/05/1943 15:00   NA
01/05/1943 16:00  5.8
01/05/1943 17:00  5.8
01/05/1943 18:00  6.3

I need to check these files to see if they have sufficient data density. 我需要检查这些文件以查看它们是否具有足够的数据密度。 Ie that the ratio of NA's to data values is not too high. 即NA与数据值的比率不是太高。 To do this I have 3 criteria that must be checked for each file: 为此,我必须为每个文件检查3个条件:

  1. Ensure that no more than 10% of the hours in a day are NA's 确保一天中不超过10%的小时是不适用的
  2. Ensure that no more than 10% of the days in a month are NA's 确保一个月中不超过10%的天是NA
  3. Ensure that there are 3 continuous years of data with valid days and months. 确保连续3年的数据带有有效的日期和月份。

Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria. 必须依次满足每个条件,如果文件不符合要求,那么我必须为不满足条件的文件创建一个数据框(或任何列表)。

QUESTION-------- 题 - - - -

I wanted to ask the community how to go about this. 我想问社区如何做。 I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. 我考虑了嵌套if循环的值,以及使用sqldf,plyr,aggregate甚至dplyr的值。 But I do not know the simplest way to achieve this. 但是我不知道实现这一目标的最简单方法。 Any example code or suggestions would be very much appreciated. 任何示例代码或建议将不胜感激。

I think this will work for you. 我认为这对您有用。 These will check every hour for NA's in the next day, month or 3 year period. 这些将在第二天,每月或3年内每小时检查一次NA。 Not tested because I don't care to make up data to test it. 未测试,因为我不在乎组成数据来对其进行测试。 These functions should spit out the number of NA's in the respective time period. 这些功能应吐出相应时间段内的NA数。 So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. 因此,对于函数检查日,如果它返回的值大于2.4,则根据您的10%规则,您将遇到问题。 For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. 对于72个月和3年期,您希望值小于2628。再次请检查这些功能。 By the way the functions assume your NA data is in column 2. Cheers. 顺便说一下,这些功能假定您的NA数据在第2列中。

checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}

checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}

check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}

So I ended up testing these. 所以我最终测试了这些。 They work for me. 他们为我工作。 Here are system times for a dataset a year long. 这是一年的数据集的系统时间。 So I don't think you'll have problems. 所以我认为您不会有问题。

> system.time(checkdays(RM_W1))
   user  system elapsed 
   0.38    0.00    0.37 
> system.time(checkmonth(RM_W1))
   user  system elapsed 
   0.62    0.00    0.62

Optimization: I took the time to run these functions with the data you posted above and it wasn't good. 优化:我花了一些时间使用上面发布的数据来运行这些功能,但这并不好。 For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. For循环很危险,因为它们适用于较小的数据集,但是随着数据集的增大(即如果它们的构造不正确)而呈指数级降低。 I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. 我无法使用您的数据报告上述功能的系统时间(从未完成),但我等待了大约30分钟。 After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. 阅读了这篇很棒的文章后在R中加快了循环操作的速度,将功能重写了很多。 By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. 通过最大程度地减少循环中发生的事情并预先分配内存,您可以真正加快速度。 You need to call the function like checkdays(df[,2]) but its faster this way. 您需要像checkdays(df[,2])那样调用该函数checkdays(df[,2])但这种方法速度更快。

checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
   user  system elapsed 
   4.41    0.00    4.41 

I believe this should be sufficient for your needs. 我相信这应该足以满足您的需求。 In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. 关于leap年,您应该能够修改我在评论中提到的优化功能。 However make sure you specify a leap year dataset as second dataset rather than a second column. 但是,请确保将specify年数据集指定为第二数据集而不是第二列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM