Data organization precipitation

Question

I have a precipitation database, where it is structured as follows.

Season; YEAR; MONTH; DAY 01; DAY 02; DAY 03 ..... DAY 31

data here

At first I wanted to calculate the accumulated in each month (I did it using the precintcon), but only for one season. Now I want to do the same thing, but separating each station, where I will have the daily and monthly values for each station, in addition to changing the structure of the database. Where the first column would be the date and the other columns would be each season.

Date; season1; station2; estacao3 ....... estacaoN

01/01/1994;30;10;5;6
01/02/1994;10;12;55
.
.
.
.
.
.
.
31/07/2018

Answer 1

First, as your dataframe is pretty heavy (I only run the code on a portion of it), you can open it with fread function from data.table (I convert your xlsx file in a csv file).

library(data.table)
df <- fread("../Dados_precipitacao.csv", skip = 2, header = TRUE)

Then, you can reshape your dataframe in a long format by using melt function from data.table :

library(data.table)
colonne <- grep("dia",colnames(df),value = TRUE)
dt.m <- melt(df, measure = list(colonne),value.name = "DIA")

Now, you have six columns:

   Município/Posto  Bacia  Ano Mês variable DIA
1:     Agua Branca Piancó 1994   1    dia 1   0
2:     Agua Branca Piancó 1994   2    dia 1   0
3:     Agua Branca Piancó 1994   3    dia 1  20
4:     Agua Branca Piancó 1994   4    dia 1   0
5:     Agua Branca Piancó 1994   5    dia 1   0
6:     Agua Branca Piancó 1994   6    dia 1   0

Now, using data.table , we can create a date column by pasting Ano, Mes and Dia (Dia will be modify to remove "dia " from the string), then, we will use the ymd function from the lubridate package to converting this character string in a data format:

library(data.table)
test <- dt.m[1:1000,]
test[, Day:=gsub("dia ","",variable)]
test[, Date := do.call(paste, c(.SD, sep = "-")), .SDcols = c("Ano","Mês","Day")]
test[, Date:= ymd(Date)]

      Município/Posto      Bacia  Ano Mês variable DIA Day       Date
   1:     Agua Branca     Piancó 1994   1    dia 1   0   1 1994-01-01
   2:     Agua Branca     Piancó 1994   2    dia 1   0   1 1994-02-01
   3:     Agua Branca     Piancó 1994   3    dia 1  20   1 1994-03-01
   4:     Agua Branca     Piancó 1994   4    dia 1   0   1 1994-04-01
   5:     Agua Branca     Piancó 1994   5    dia 1   0   1 1994-05-01
  ---                                                                
 996:     Alagoa Nova Mamanguape 2003   8    dia 1   0   1 2003-08-01
 997:     Alagoa Nova Mamanguape 2003   9    dia 1   0   1 2003-09-01
 998:     Alagoa Nova Mamanguape 2003  10    dia 1   0   1 2003-10-01
 999:     Alagoa Nova Mamanguape 2003  11    dia 1   0   1 2003-11-01
1000:     Alagoa Nova Mamanguape 2003  12    dia 1   0   1 2003-12-01

Now, we can use the function dcast from data.table to pivot the datatable in a wider format and create one column for each station (here I used Municipio/Posto):

library(data.table)
t <- dcast(test, value.var = "DIA", ... ~ `Município/Posto`)

          Bacia  Ano Mês variable Day       Date Agua Branca Aguiar Alagoa Grande Alagoa Nova
  1: Mamanguape 1994   1    dia 1   1 1994-01-01          NA     NA             0           0
  2: Mamanguape 1994   2    dia 1   1 1994-02-01          NA     NA             0           0
  3: Mamanguape 1994   3    dia 1   1 1994-03-01          NA     NA             0           0
  4: Mamanguape 1994   4    dia 1   1 1994-04-01          NA     NA             0           0
  5: Mamanguape 1994   5    dia 1   1 1994-05-01          NA     NA             0           0
 ---                                                                                         
584:     Piancó 2018   3    dia 1   1 2018-03-01         5.4      0            NA          NA
585:     Piancó 2018   4    dia 1   1 2018-04-01        12.6      0            NA          NA
586:     Piancó 2018   5    dia 1   1 2018-05-01        15.8     NA            NA          NA
587:     Piancó 2018   6    dia 1   1 2018-06-01         0.0     NA            NA          NA
588:     Piancó 2018   7    dia 1   1 2018-07-01         0.0     NA            NA          NA

Hope that it is what you are looking for.

BTW: It will make things easier for everyone, if you post a reproducible example of your data instead of inserting a link to your full dataset (that is pretty heavy). To know how to do a good reproducible example: How to make a great R reproducible example

Answer 2

This task requires some reshaping of the dataset, first make it longer and then wider again. dc37's answer already describes how to do that with data.table . I'd recommend a little different approach, using only tidyverse functions.

You state, that you want to calculate the sum of the rainfall per month at each station, for that task it is actually easier to keep the data in a long format instead of making it wide again. I'll demonstrate both options (2a and 2b) below.

I would also recommend not merging the date variables, because that makes it harder to group the data by month, alternatively to my approach, you could merge year and month only, that would still allow for the necessary grouping. Anyways, 2a) demonstrates how to use tidyr::unite() to merge the date variables.

1) Convert dataset to long format

  library(tidyverse)
  library(readxl)
  rainfall_df <- read_excel("Dados_precipitacao.xls", skip = 2)

  rainfall_long_df <-
    rainfall_df %>%
    select(-Bacia) %>%
    pivot_longer(`dia 1`:`dia 31`, names_to = "dia") %>%
    mutate(dia = gsub("dia ", "", dia))

rainfall_long_df looks like this:

 # A tibble: 1,931,889 x 5
   `Município/Posto`   Ano   Mês dia   value
   <chr>             <dbl> <dbl> <chr> <dbl>
 1 Agua Branca        1994     1 1       0  
 2 Agua Branca        1994     1 2       0  
 3 Agua Branca        1994     1 3       0  
 4 Agua Branca        1994     1 4       0  
 5 Agua Branca        1994     1 5       0  
 6 Agua Branca        1994     1 6       8.6
 7 Agua Branca        1994     1 7       0  
 8 Agua Branca        1994     1 8       2  
 9 Agua Branca        1994     1 9       0  
10 Agua Branca        1994     1 10      0  
# … with 1,931,879 more rows

2a) This is what you asked for: Calculating the sums per month and station from a wide dataset.

rainfall_wide_df <-   
  rainfall_long_df %>%
    unite(data, dia, Mês, Ano, sep = "/", remove = FALSE) %>%
    pivot_wider(names_from = `Município/Posto`)

rainfall_wide_df %>% 
    group_by(Ano, Mês) %>% 
    summarise_at(vars(`Agua Branca`:`Zabelê`), sum)

This results in:

# A tibble: 296 x 253
# Groups:   Ano [26]
     Ano   Mês `Agua Branca` Aguiar `Alagoa Grande` `Alagoa Nova` Alagoinha Alcantil `Algodão de Jan…
   <dbl> <dbl>         <dbl>  <dbl>           <dbl>         <dbl>     <dbl>    <dbl>            <dbl>
 1  1994     1         174.   442.            101            68.5      64.6       NA             NA  
 2  1994     2          NA     NA              NA            NA        NA         NA             NA  
 3  1994     3         285.   120.            239.          210.      213.        NA             NA  
 4  1994     4          NA     NA              NA            NA        NA         NA             NA  
 5  1994     5         176.    73.2           160.          233.      190         NA             41.8
 6  1994     6          NA     NA              NA            NA        NA         NA             NA  
 7  1994     7          55.6   33.3           292.          188.      291.        NA             51.4
 8  1994     8          28      0              60.8          68.1      57.6       NA             16.1
 9  1994     9          NA     NA              NA            NA        NA         NA             NA  
10  1994    10          20      0               8.8           9.3       3.6       NA              0  
# … with 286 more rows, and 244 more variables

2b) This is an alternative solution to get the sums for each station and month. Which is easier to work with for further steps (visualization in ggplot2 especially). Also I feel, that the code is more straight forward!

rainfall_long_df %>%
    group_by(`Município/Posto`, Ano, Mês) %>%
    summarise(rainfall_per_month = sum(value))

The result will be a long version of the sum of rainfall per month and station.

# A tibble: 62,319 x 4
# Groups:   Município/Posto, Ano [5,522]
   `Município/Posto`   Ano   Mês rainfall_per_month
   <chr>             <dbl> <dbl>              <dbl>
 1 Agua Branca        1994     1              174. 
 2 Agua Branca        1994     2               NA  
 3 Agua Branca        1994     3              285. 
 4 Agua Branca        1994     4               NA  
 5 Agua Branca        1994     5              176. 
 6 Agua Branca        1994     6               NA  
 7 Agua Branca        1994     7               55.6
 8 Agua Branca        1994     8               28  
 9 Agua Branca        1994     9               NA  
10 Agua Branca        1994    10               20  
# … with 62,309 more rows

Answer 3

First, I would like to thank you for your responses. Second, I apologize for the question that is not in the correct structure (my first time here), I am also new to the universe of R. I am using this data as part of a hydrology study and this structure is necessary to use the HydroTSM package and later on SWAT.

I did the recommended tests, but some questions came up. and both attended to the resolution of my problem. But, I realized that when the dates were created, the leap years had a small problem, however I removed these dates manually.

How could you do to consider leap years in building the database?

Thank you.

Data organization precipitation

Question

3 answers

solution1
1 2020-01-24 17:05:17

solution2
0 ACCPTED 2020-01-24 17:50:04

solution3
0 2020-01-24 22:32:45

Data organization precipitation

Question

3 answers

solution1 1 2020-01-24 17:05:17

solution2 0 ACCPTED 2020-01-24 17:50:04

solution3 0 2020-01-24 22:32:45

solution1
1 2020-01-24 17:05:17

solution2
0 ACCPTED 2020-01-24 17:50:04

solution3
0 2020-01-24 22:32:45