usually I can find the answers to questions from a combination of Google and searching on StackOverflow, but this time I am a little stuck. I have a dataframe that contains some values collected on certain dates across multiple years. Here is a sample:
"Date" "TEMP" "TEMP2"
"1" 2007-06-19 NA NA
"2" 2007-06-19 24.4 22
"3" 2007-06-19 24.4 22
"4" 2007-06-19 NA NA
"5" 2007-06-19 25.1 22.65
"6" 2007-06-19 NA NA
"7" 2007-06-19 25.29 22.94
"8" 2007-06-19 25.20 22.69
"9" 2007-06-19 NA NA
"10" 2007-06-19 25.20 22.69
"11" 2007-06-19 NA NA
"12" 2007-06-19 25.9 23.94
What I want to do is create a new datatable (let's call it df2
) that has averages for Temp
and Temp2
on a per-month basis. The output needs to be in this format and the NA
's need to be excluded:
"DateFrom" "DateTo" "avg_TEMP" "avg_TEMP2"
"1" 2007-06-1 2007-06-30 12.3 22.5
"2" 2007-07-1 2007-07-30 13.4 33.4
I have created a stub of df2
and populated it with DateFrom
and DateTo
for the ranges I am interested in. As a programmer I know I can solve this with a for loop, but I do not think that is the correct way to approach the problem in R. Based on my limited understanding of R and other answers I've reviewed I need to use some form of the apply
function on df2
. I tried the code below, but I think I am just misunderstanding how apply
, etc, work, because it didn't work.
get.daterange <- function(df, date1, date2) {
df[df$Date >= date1 & df$Date <= date2,]
tmp <- subset((df[df$Date >= date1 & df$Date <= date2,]),select=data.reportval)
tmp.1 = colMeans(tmp, na.rm = TRUE)
return(tmp.1)
}
df2 <- mapply(get.daterange, df1, df2$DateFrom, df2$DateTo)
The function probably has some problem... it works correctly if I give it two single values (so I could iterate across df2
with a for loop and use it), but does not seem to do the correct thing if given dataframe columns.
I would recommend using the lubridate
and dplyr
packages. lubridate
helps with date-time manipulation and `dplyr helps with data frame manipulation. Then, you can use something like this:
df2 <- df %>%
mutate(month = month(Date), year = year(Date)) %>%
group_by(month, year) %>%
summarize(avg_TEMP = mean(TEMP, na.rm = TRUE),
avg_TEMP2 = mean(TEMP2, na.rm = TRUE),
std_TEMP = sd(TEMP, na.rm = TRUE),
std_TEMP2 = sd(TEMP2, na.rm = TRUE),
number_of_records = n(),
non_NA_records = sum(!is.na(TEMP)))
So, the first line adds month and year columns, the second groups the data frame by those columns, and the third calculates the average temperatures for every group.
EDIT : added standard deviations and record count. You probably don't want to use apply()
to work with data frames; dplyr is tailor-made for the purpose and apply()
will only give you headaches, as it is designed for working with lists.
The book R for data science is very helpful for this sort of task
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.