简体   繁体   English

有条件的季节性平均时间序列数据

[英]Conditional Seasonal Averaging Time-Series Data

Introduction 介绍

Summary: 摘要:

  • Trying to average data by season (when necessary) when certain conditions are met. 当满足某些条件时,尝试按季节(必要时)对数据进行平均。

Hello everyone. 大家好。

I am currently working with numerous large data sets (>200 sets with >5000 rows each) of long-term time series data collection for multiple variables across different locations. 我目前正在处理大量大数据集(> 200个集合,每个集合具有> 5000行),用于长期时间序列数据收集,以获取不同位置的多个变量。 So far, I've extracted data into separate CSV files per site and per station. 到目前为止,我已将每个站点和每个站点的数据提取到单独的CSV文件中。

For the most part, the data reported per parameter is one instance per season. 在大多数情况下,每个参数报告的数据是每个季节一个实例。

Season here is defined ecologically as DJF, MAM, JJA, SON for months corresponding to Winter, Spring, Summer, and Fall respectively. 这里的季节在生态上被定义为DJF,MAM,JJA,SON,分别对应于冬季,春季,夏季和秋季的月份。

However, there are some cases where there were multiple readings during a seasonal event. 但是,在某些情况下,季节性事件期间会有多个读数。 Here, the parameter values and dates have to be averaged; 在此,必须对参数值和日期进行平均。 this is before further analysis can take place on these data sets. 这是在可以对这些数据集进行进一步分析之前。

To complicate things even further, some of the data is marked by a Greater Than or Less Than (GTLT) symbol). 为了使事情更加复杂,某些数据用大于或小于(GTLT)符号标记)。 In these cases, values and dates are not averaged unless the recorded value is the same. 在这些情况下, 除非记录的值相同, 否则不会对值和日期进行平均。


Data Example 资料范例

Summary: 摘要:

  • Code and Tables show requested changes in data-set 代码和表显示了数据集中请求的更改

So, for a data-driven example... 因此,对于数据驱动的示例...

Here's a few rows from a data set. 这是数据集中的几行。

Data.Example<-structure(list(
    Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402), 
    End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"), 
    Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
    GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
    Value = c(10, 10, 20, 30, 15, 10)), 
    .Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")

This is ideally what I would like as output 理想情况下,这就是我想要的输出

Data.Example.New<-structure(list(
    Station.ID.new = c(13402, 13402, 13402, 13402), 
    End.Date.new = structure(c(2L, 3L, 2L, 1L), .Label = c("10/13/2016", "7/28/2016", "8/15/2016"), class = "factor"), 
    Parameter.Name.new = structure(c(2L, 2L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"), 
    GTLT.new = structure(c(2L, 2L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
    Value.new = c(10, 20, 22.5, 10)), 
    .Names = c("Station.ID.new", "End.Date.new", "Parameter.Name.new", "GTLT.new", "Value.new"), row.names = c(NA, -4L), class = "data.frame")

Here, the following things are occurring: 在这里,发生了以下事情:

  • For Enterococci measured in July and Aug 13, there is a GTLT symbol, but Value for both == 10. So average dates. 对于在7月和8月13日测量的肠球菌,有一个GTLT符号,但两个值均==10。因此是平均日期。 New row is 7/28/2016 and Value 10. 新行是2016年7月28日,值是10。
  • While Enterococci on Aug 15 is within same season as other values, since GTLT value is different, it would only be averaged in same season of same year with other values of 20. In this case, since it is only one where Value==20, that row does not change and is repeated in final data frame. 尽管8月15日的肠球菌与其他值处于同一季节内,但由于GTLT值不同,因此它只能在同一年的同一季节取平均值,其他值为20。在这种情况下,因为只有Value == 20的一个值,该行不会更改,并在最终数据帧中重复。
  • Alkalinity in July and August are same season, so average dates (7/28/16) and Value (22.5) in new row. 7月和8月的碱度是同一季节,因此平均日期(7/28/16)和值(22.5)在新行中。
  • Alkalinity in October is different season, so keep row. 十月的碱度是不同的季节,所以请保持排序。
  • All other data (such as Station.ID and Parameter.Name) should just be copied since they shouldn't differ here. 所有其他数据(例如Station.ID和Parameter.Name)应仅被复制,因为它们在此处没有区别。

If for some reason you have a GTLT and non-GTLT for same parameter: 如果出于某些原因,您为同一参数设置了GTLT和non-GTLT:

End.Date    GTLT    Value    Parameter
7/13/2015     <      10         Alk
7/13/2016     <      10         Alk
8/13/2016            10         Alk
8/15/2016            20         Alk

Then final result would be 那么最终结果将是

End.Date    GTLT    Value    Parameter
7/13/2015     <      10         Alk
7/13/2016     <      10         Alk
8/14/2016            15         Alk

Approach 方法

Summary: 摘要:

  • Define seasons and then aggregate using package like dplyr ? 定义季节,然后使用dplyr这样的软件包进行dplyr
  • Create loop function to read row by row (after sort by Parameter.Name then Date?) 创建循环功能以逐行读取(按Parameter.Name排序然后按Date排序后)

As one might expect, this is where I'm stuck. 正如人们可能期望的那样,这就是我遇到的问题。

I know seasons can be defined in R from prior Stack Q's: 我知道可以在R中根据先前的Stack Q定义季节:

New vector of seasons based on dates 基于日期的季节新载体

And I know that average/aggregation packages such as dplyr (and possibly zoo ?) can do chaining commands. 而且我知道dplyr (可能还有zoo ?)之类的平均/聚合程序包可以执行链接命令。

My issue is putting this thought process into code that can be repeated for each data set. 我的问题是将此思考过程放入可以为每个数据集重复的代码中。

I'm not sure if that's the best approach (define seasons and then set conditions for averaging data), or if some sort of loop function would work here by going through row by row of the data set post-sort by Parameter.Name then End.Date. 我不确定这是否是最好的方法(先定义季节,然后设置平均数据条件),或者某种循环功能是否可以通过按Parameter.Name后排序的数据集逐行进行处理,然后在这里工作结束日期。

I quickly sketched my thoughts on what some sort of loop function would have to include: 我很快就一些循环功能必须包括的内容勾勒出了自己的想法:

Rough idea of flow diagram 流程图的粗略概念

Note, you can't just average starting row [i] and [i+1] because [i+2], etc. might need averaged as well. 请注意,您不能仅对[i]和[i + 1]起始行进行平均,因为[i + 2]等也可能需要取平均值。 Hence finding row [i+n] that breaks loop before last step, averaging all prior rows [i+n-1], and moving on to next new row [i+n]. 因此,找到在上一步之前中断循环的行[i + n],对所有先前的行[i + n-1]求平均值,然后继续进行下一个新行[i + n]。

Further, as clarification, the season would have to be within season of that annual cycle. 此外,为澄清起见,季节必须在该年度周期的季节之内。 So 7/13/2016 == 8/13/2016 for same season. 所以7/13/2016 == 8/13/2016同一季节。 12/12/2015 == 01/01/2016 for same season. 2015年12月12日== 2016年1月1日。 But 4/13/2016! 但是2016/4/13! == 4/13/2015 in regards to averaging. == 4/13/2015关于平均。


Conclusion and Summary 结论与总结

In short, I need help designing code to average individual parameter time-series values by annual season with specific exceptions for multiple large data sets. 简而言之,我需要帮助设计代码,以便按年度平均各个参数的时间序列值,但有多个大型数据集的特定例外。

I'm not sure of the best approach in designing code to do this, whether it's a large loop function or a combination of code and specialized chaining-enabled packages. 我不确定设计代码的最佳方法是大型循环功能,还是代码与支持链接的专用程序包的组合。

Thank you for your time in advance. 谢谢您的宝贵时间。

Cheers, 干杯,

soccernamlak Soccernamlak

Using dplyr and lubridate I was able to come up with a solution. 使用dplyrlubridate我可以提出一个解决方案。 My output matches your example output, except I did not keep the exact dates, which I felt were misleading in the final result. 我的输出与您的示例输出匹配,除了我没有保留确切的日期之外,我认为这会在最终结果中产生误导。

Data.Example<-structure(list(
  Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402), 
  End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"), 
  Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
  GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"), 
  Value = c(10, 10, 20, 30, 15, 10)), 
  .Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")

# Create season key
seasons <- data.frame(month = 1:12, season = c(rep("DJF",2), rep("MAM", 3), rep("JJA", 3), rep("SON",3), "DJF"))

# Isolate Month and Year, create Season column
Data.Example$Month <- lubridate::month(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Year <- lubridate::year(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Season <- seasons$season[Data.Example$Month]

# Update 'year' where month = December so that it is grouped with Jan and Feb of following year
Data.Example$Year[Data.Example$Month == 12] <- Data.Example$Year[Data.Example$Month == 12]+1

# Find out which station/year/season/paramaters have at least one record with a GTLT
GTLT.Test<- Data.Example %>% 
  group_by(Station.ID, Year, Season, Parameter.Name) %>%
  summarize(has_GTLT = max(nchar(as.character(GTLT))))

# First only calculate averages for groups without any GTLT
Data.Example.New1 <- Data.Example %>% 
  anti_join(GTLT.Test[GTLT_test$has_GTLT == 1,], 
            by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
  group_by(Station.ID, Year, Season, Parameter.Name, GTLT) %>%
  summarize(Value.new = mean(Value))

# Now do the same for groups with GTLT, only combining when values and GTLT symbols match.
Data.Example.New2 <- Data.Example %>% 
  anti_join(GTLT.Test[GTLT_test$has_GTLT == 0,], 
            by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
  group_by(Station.ID, Year, Season, Parameter.Name, GTLT, Value) %>%
  summarize(Value.new = mean(Value)) %>%
  select(-Value)

# Combine both
Data.Example.New <- rbind(Data.Example.New1, Data.Example.New2)

EDIT : I just noticed you linked to another SO question for converting dates to seasons. 编辑 :我刚刚注意到您链接到另一个SO问题,以将日期转换为季节。 Mine simply converts by month, not date, and does not use actual seasons. 我的只是按月而不是日期进行转换,并且不使用实际季节。 I did this because in your example, Dec. 12 matches with Jan. 1. December 12 is technically fall, so I assumed you weren't using actual seasons, but were instead using four three-month groupings. 我这样做是因为在您的示例中,12月12日与1月1日匹配。从技术上讲,12月12日是秋季,所以我假设您不是使用实际的季节,而是使用四个三个月的分组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM