简体   繁体   中英

R: calculating mean and sd for each week of the year

I have 4 data frames, each data frame corresponding to a single year. Each data frame contains daily rainfall for five locations.

Generate sample data

    location <- c("A","B","C","D","E")
    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits=2)
    dat.1981 <-as.data.frame(cbind(location,mat)) # rainfall for 1981
    dat.1981$year <- 1981

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1982 <- as.data.frame(cbind(location,mat)) # rainfall for 1982
    dat.1982$year <- 1982

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1983 <-as.data.frame(cbind(location,mat)) # rainfall for 1983
    dat.1983$year <- 1983

    mat <- round(as.data.frame(matrix(runif(1825),nrow=5,ncol=365)), digits = 2)
    dat.1984 <-as.data.frame(cbind(location,mat)) # rainfall for 1984
    dat.1984$year <- 1984

    dat <- as.data.frame(rbind(dat.1981,dat.1982,dat.1983,dat.1984))

For each year, I want to classify whether a day was an extreme wet day or not

Here's how I do my calculation:

1) For each location, generate the mean and sd of rainfall for every week for the period 1981 to 1984. For example, in location A the mean rainfall for the first week will be:

(First week rain 1981 in A + First week rain 1982 in A + First week rain 1983 in A + First week rain 1984 in A)/4

which can be written in R as

    mean.week1.loc1 <- mean(rowSums(dat[dat$location=="A",2:8])) # 2:8 selects the first 7 days in each year
    sd.week1.loc1 <- sd(rowSums(dat[dat$location=="A",2:8])) 

    wet.cr <- mean.week1 + sd.week1 # this is my threshold for defining a wet day

If daily rainfall in week 1 for the years 1981 to 1984 in location A is greater than wet.cr , that day is a wet day and hence gets a value of 1

As an example, to examine whether rainfall of week 1 for location A for 1981 to 1984 is a wet day or not I can do the following:

   lapply(dat[, 2:8], function(x) ifelse(x > wet.cr, 1, 0))

I want to repeat this for each week and each location.

However, I am unable to stitch these individual functions together and also my final results should be a dataframe same as dat but instead of rainfall values, I will have 1 or 0 defining whether it is a wet day or not.

EDIT

The solutions below gives me the following:

mean(c(rainfall 1981 day 1 week 1, ...., rainfall 1981 day 7 week 1, rainfall 1982 day 1 week 1,....,rainfall 1982 day 7 week 1,....,rainfall 1984 day 1 week 1,....,rainfall 1984 day 7 week 1))

WHAT I WANT IS:

mean(c(mean(total rainfall week 1 1981), mean(total rainfall week 1 1982), mean(total rainfall week 1 1983), mean(total rainfall week 1 1984)))

I hope this is clear now.

A tidyverse solution

    library(magrittr)
    library(tidyverse)

    dat_m <- gather(dat, day, rainfall, -location, -year)
    str(dat_m)

    dat_m %<>%
      mutate(day = gsub("V", "", day)) %>%
      mutate(day = as.numeric(day)) %>% 
      mutate(week = as.integer(ceiling(day/7))) %>% 
      group_by(location, week) %>% 
      mutate(wet.cr = mean(rainfall, na.rm = TRUE) + sd(rainfall, na.rm = TRUE) ) %>% 
      mutate(indication = ifelse(rainfall > wet.cr, 1, 0)) %>% 
      ungroup()
    dat_m 

    # A tibble: 7,300 x 7
       location  year   day rainfall  week wet.cr indication
       <fctr>   <dbl> <dbl>    <dbl> <int>  <dbl>      <dbl>
     1 A         1981  1.00    0.880     1  0.845       1.00
     2 B         1981  1.00    0.850     1  0.829       1.00
     3 C         1981  1.00    1.00      1  0.877       1.00
     4 D         1981  1.00    0.100     1  0.755       0   
     5 E         1981  1.00    0.190     1  0.750       0   
     6 A         1982  1.00    0.380     1  0.845       0   
     7 B         1982  1.00    0.760     1  0.829       0   
     8 C         1982  1.00    0.940     1  0.877       1.00
     9 D         1982  1.00    0.900     1  0.755       1.00
    10 E         1982  1.00    0.600     1  0.750       0   
    # ... with 7,290 more rows

Edit : For rainfall, I think it's better to use sum (total) than mean

So we first calculate the total weekly rainfall for every year. Then we calculate the long-term average & stdev of the total weekly rainfall.

    dat_m %<>%
      mutate(day = as.numeric(gsub("V", "", day)),
             week = as.integer(ceiling(day/7))) %>%
      group_by(location, week, year) %>% 
      mutate(total_weekly_rainfall = sum(rainfall, na.rm = TRUE)) %>% 
      ungroup() %>% 
      group_by(location, week) %>% 
      mutate(mean_weekly_rainfall = sum(rainfall, na.rm = TRUE)/length(unique(year)),
             stddev_weekly_rainfall = sd(rainfall, na.rm = TRUE),
             wet.cr =  mean_weekly_rainfall + stddev_weekly_rainfall,
             indication = ifelse(total_weekly_rainfall > wet.cr, 1, 0)) %>% 
      arrange(location, year, day) %>% 
      ungroup() %>% 
      distinct(location, year, week, .keep_all = TRUE)
    dat_m 

    # A tibble: 1,060 x 10
       location  year   day rainfall  week total_wee~ mean_wee~ stddev_w~ wet.~ indic~
       <fctr>   <dbl> <dbl>    <dbl> <int>      <dbl>     <dbl>     <dbl> <dbl>  <dbl>
     1 A         1981  1.00   0.880      1     0.880      0.630     0.277 0.907      0
     2 A         1981  8.00   0.190      2     0.190      0.330     0.431 0.761      0
     3 A         1981 15.0    0.630      3     0.630      0.548     0.331 0.878      0
     4 A         1981 22.0    0.0300     4     0.0300     0.290     0.259 0.549      0
     5 A         1981 29.0    0.360      5     0.360      0.308     0.196 0.504      0
     6 A         1981 36.0    0.540      6     0.540      0.500     0.225 0.725      0
     7 A         1981 43.0    0.0300     7     0.0300     0.375     0.289 0.664      0
     8 A         1981 50.0    0.170      8     0.170      0.332     0.375 0.708      0
     9 A         1981 57.0    0.260      9     0.260      0.652     0.272 0.924      0
    10 A         1981 64.0    0.590     10     0.590      0.512     0.202 0.715      0
    # ... with 1,050 more rows

using data.table :

library(data.table)
dat <- setDT(dat)
newdat <- melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")
newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
newdat[,Week := day %/% 7]
newdat[,threshold := mean(rain) + sd(rain),  by = .(location,Week)]
newdat[,wet := ifelse(rain > threshold,1,0)]
print(newdat,topn = 100)


      location year day rain Week threshold wet
   1:        A 1981   1 0.73    0 0.7630065   0
   2:        B 1981   1 0.69    0 0.8599243   0
   3:        C 1981   1 0.45    0 0.8145956   0
   4:        D 1981   1 0.51    0 0.8935058   0
   5:        E 1981   1 0.77    0 0.6992752   1
   6:        A 1982   1 0.47    0 0.7630065   0
   7:        B 1982   1 0.70    0 0.8599243   0
   8:        C 1982   1 0.48    0 0.8145956   0
   9:        D 1982   1 0.92    0 0.8935058   1

a step by step explanation: first you need to change format of your data to ease the calculus. The long format is more appropriate, as each column V## is actually a variable that is the number day. This is done using melt

melt(dat, measure.vars = patterns("^V"),variable.name = "day",value.name = "rain")

     location year  day rain
   1:        A 1981   V1 0.73
   2:        B 1981   V1 0.69
   3:        C 1981   V1 0.45
   4:        D 1981   V1 0.51
   5:        E 1981   V1 0.77
  ---                        
7296:        A 1984 V365 0.31
7297:        B 1984 V365 0.99
7298:        C 1984 V365 0.25
7299:        D 1984 V365 0.24
7300:        E 1984 V365 0.87

Then you tranform your day to a real number, in order to be able to calculate the week

newdat[,day := as.character(day)]
newdat[,day := as.numeric(unlist(lapply(newdat$day,function(x){strsplit(x,"V")[[1]][2]})))]
> newdat[,.(day,year)]
      day year
   1:   1 1981
   2:   1 1981
   3:   1 1981
   4:   1 1981
   5:   1 1981

Then calculate the week number the same as you do

newdat[,Week := day %/% 7]

The statistic for the calculus of the sthreshold is done by grouping on the weeks and places (so statistic over the year for each place)

newdat[,threshold := mean(rain) + sd(rain), by = .(location,Week)]

and define your wet day as the day when rain is higher than the threshold

newdat[,wet := ifelse(rain > threshold,1,0)]

but I agree on the comment that the initial data was surely in a better format than what you are presenting.

For both the data.table and the tidyverse solutions you may be well served to treat this as a scaling exercise (az score in many disciplines), since the mean + n standard deviation is a well known benchmark.

For the data.table solution you would:

newdat[,zrain := scale(rain),  by = .(location,Week)]
newdat[,zwet := ifelse(zrain > 1.0,1,0)]

where you rely on scale from base and compare to 1.0

For the tidyverse that becomes:

mutate(zrain = scale(rainfall)) %>% 
mutate(indication = ifelse(zrain > 1.0, 1, 0)) %>% 

That way in the future if your standard for what "wet" is changes you only have to change one number in your code

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM