简体   繁体   中英

R Language: Calculating an average across a subset of rows between a certain date and saving the results to a new dataframe

usually I can find the answers to questions from a combination of Google and searching on StackOverflow, but this time I am a little stuck. I have a dataframe that contains some values collected on certain dates across multiple years. Here is a sample:

        "Date"      "TEMP"  "TEMP2"
"1"     2007-06-19  NA      NA
"2"     2007-06-19  24.4    22
"3"     2007-06-19  24.4    22
"4"     2007-06-19  NA      NA
"5"     2007-06-19  25.1    22.65
"6"     2007-06-19  NA      NA
"7"     2007-06-19  25.29   22.94
"8"     2007-06-19  25.20   22.69
"9"     2007-06-19  NA      NA
"10"    2007-06-19  25.20   22.69
"11"    2007-06-19  NA      NA
"12"    2007-06-19  25.9    23.94

What I want to do is create a new datatable (let's call it df2 ) that has averages for Temp and Temp2 on a per-month basis. The output needs to be in this format and the NA 's need to be excluded:

        "DateFrom"      "DateTo"        "avg_TEMP"  "avg_TEMP2"
"1"     2007-06-1       2007-06-30      12.3        22.5
"2"     2007-07-1       2007-07-30      13.4        33.4

I have created a stub of df2 and populated it with DateFrom and DateTo for the ranges I am interested in. As a programmer I know I can solve this with a for loop, but I do not think that is the correct way to approach the problem in R. Based on my limited understanding of R and other answers I've reviewed I need to use some form of the apply function on df2 . I tried the code below, but I think I am just misunderstanding how apply , etc, work, because it didn't work.

get.daterange <- function(df, date1, date2) {
  df[df$Date >= date1 & df$Date <= date2,]
  tmp <- subset((df[df$Date >= date1 & df$Date <= date2,]),select=data.reportval)
  tmp.1 = colMeans(tmp, na.rm = TRUE)
  return(tmp.1)
}

df2 <- mapply(get.daterange, df1, df2$DateFrom, df2$DateTo)

The function probably has some problem... it works correctly if I give it two single values (so I could iterate across df2 with a for loop and use it), but does not seem to do the correct thing if given dataframe columns.

I would recommend using the lubridate and dplyr packages. lubridate helps with date-time manipulation and `dplyr helps with data frame manipulation. Then, you can use something like this:

df2 <- df %>%
  mutate(month = month(Date), year = year(Date)) %>%
  group_by(month, year) %>%
  summarize(avg_TEMP = mean(TEMP, na.rm = TRUE),
            avg_TEMP2 = mean(TEMP2, na.rm = TRUE),
            std_TEMP = sd(TEMP, na.rm = TRUE),
            std_TEMP2 = sd(TEMP2, na.rm = TRUE),
            number_of_records = n(),
            non_NA_records = sum(!is.na(TEMP)))

So, the first line adds month and year columns, the second groups the data frame by those columns, and the third calculates the average temperatures for every group.

EDIT : added standard deviations and record count. You probably don't want to use apply() to work with data frames; dplyr is tailor-made for the purpose and apply() will only give you headaches, as it is designed for working with lists.

The book R for data science is very helpful for this sort of task

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM