I have a precipitation database, where it is structured as follows.
Season; YEAR; MONTH; DAY 01; DAY 02; DAY 03 ..... DAY 31
At first I wanted to calculate the accumulated in each month (I did it using the precintcon), but only for one season. Now I want to do the same thing, but separating each station, where I will have the daily and monthly values for each station, in addition to changing the structure of the database. Where the first column would be the date and the other columns would be each season.
Date; season1; station2; estacao3 ....... estacaoN
01/01/1994;30;10;5;6
01/02/1994;10;12;55
.
.
.
.
.
.
.
31/07/2018
First, as your dataframe is pretty heavy (I only run the code on a portion of it), you can open it with fread
function from data.table
(I convert your xlsx file in a csv file).
library(data.table)
df <- fread("../Dados_precipitacao.csv", skip = 2, header = TRUE)
Then, you can reshape your dataframe in a long
format by using melt
function from data.table
:
library(data.table)
colonne <- grep("dia",colnames(df),value = TRUE)
dt.m <- melt(df, measure = list(colonne),value.name = "DIA")
Now, you have six columns:
Município/Posto Bacia Ano Mês variable DIA
1: Agua Branca Piancó 1994 1 dia 1 0
2: Agua Branca Piancó 1994 2 dia 1 0
3: Agua Branca Piancó 1994 3 dia 1 20
4: Agua Branca Piancó 1994 4 dia 1 0
5: Agua Branca Piancó 1994 5 dia 1 0
6: Agua Branca Piancó 1994 6 dia 1 0
Now, using data.table
, we can create a date column by pasting Ano, Mes and Dia (Dia will be modify to remove "dia " from the string), then, we will use the ymd
function from the lubridate
package to converting this character string in a data format:
library(data.table)
test <- dt.m[1:1000,]
test[, Day:=gsub("dia ","",variable)]
test[, Date := do.call(paste, c(.SD, sep = "-")), .SDcols = c("Ano","Mês","Day")]
test[, Date:= ymd(Date)]
Município/Posto Bacia Ano Mês variable DIA Day Date
1: Agua Branca Piancó 1994 1 dia 1 0 1 1994-01-01
2: Agua Branca Piancó 1994 2 dia 1 0 1 1994-02-01
3: Agua Branca Piancó 1994 3 dia 1 20 1 1994-03-01
4: Agua Branca Piancó 1994 4 dia 1 0 1 1994-04-01
5: Agua Branca Piancó 1994 5 dia 1 0 1 1994-05-01
---
996: Alagoa Nova Mamanguape 2003 8 dia 1 0 1 2003-08-01
997: Alagoa Nova Mamanguape 2003 9 dia 1 0 1 2003-09-01
998: Alagoa Nova Mamanguape 2003 10 dia 1 0 1 2003-10-01
999: Alagoa Nova Mamanguape 2003 11 dia 1 0 1 2003-11-01
1000: Alagoa Nova Mamanguape 2003 12 dia 1 0 1 2003-12-01
Now, we can use the function dcast
from data.table
to pivot the datatable in a wider format and create one column for each station (here I used Municipio/Posto):
library(data.table)
t <- dcast(test, value.var = "DIA", ... ~ `Município/Posto`)
Bacia Ano Mês variable Day Date Agua Branca Aguiar Alagoa Grande Alagoa Nova
1: Mamanguape 1994 1 dia 1 1 1994-01-01 NA NA 0 0
2: Mamanguape 1994 2 dia 1 1 1994-02-01 NA NA 0 0
3: Mamanguape 1994 3 dia 1 1 1994-03-01 NA NA 0 0
4: Mamanguape 1994 4 dia 1 1 1994-04-01 NA NA 0 0
5: Mamanguape 1994 5 dia 1 1 1994-05-01 NA NA 0 0
---
584: Piancó 2018 3 dia 1 1 2018-03-01 5.4 0 NA NA
585: Piancó 2018 4 dia 1 1 2018-04-01 12.6 0 NA NA
586: Piancó 2018 5 dia 1 1 2018-05-01 15.8 NA NA NA
587: Piancó 2018 6 dia 1 1 2018-06-01 0.0 NA NA NA
588: Piancó 2018 7 dia 1 1 2018-07-01 0.0 NA NA NA
Hope that it is what you are looking for.
BTW: It will make things easier for everyone, if you post a reproducible example of your data instead of inserting a link to your full dataset (that is pretty heavy). To know how to do a good reproducible example: How to make a great R reproducible example
This task requires some reshaping of the dataset, first make it longer and then wider again. dc37's answer already describes how to do that with data.table
. I'd recommend a little different approach, using only tidyverse
functions.
You state, that you want to calculate the sum of the rainfall per month at each station, for that task it is actually easier to keep the data in a long format instead of making it wide again. I'll demonstrate both options (2a and 2b) below.
I would also recommend not merging the date variables, because that makes it harder to group the data by month, alternatively to my approach, you could merge year and month only, that would still allow for the necessary grouping. Anyways, 2a) demonstrates how to use tidyr::unite() to merge the date variables.
1) Convert dataset to long format
library(tidyverse)
library(readxl)
rainfall_df <- read_excel("Dados_precipitacao.xls", skip = 2)
rainfall_long_df <-
rainfall_df %>%
select(-Bacia) %>%
pivot_longer(`dia 1`:`dia 31`, names_to = "dia") %>%
mutate(dia = gsub("dia ", "", dia))
rainfall_long_df looks like this:
# A tibble: 1,931,889 x 5
`Município/Posto` Ano Mês dia value
<chr> <dbl> <dbl> <chr> <dbl>
1 Agua Branca 1994 1 1 0
2 Agua Branca 1994 1 2 0
3 Agua Branca 1994 1 3 0
4 Agua Branca 1994 1 4 0
5 Agua Branca 1994 1 5 0
6 Agua Branca 1994 1 6 8.6
7 Agua Branca 1994 1 7 0
8 Agua Branca 1994 1 8 2
9 Agua Branca 1994 1 9 0
10 Agua Branca 1994 1 10 0
# … with 1,931,879 more rows
2a) This is what you asked for: Calculating the sums per month and station from a wide dataset.
rainfall_wide_df <-
rainfall_long_df %>%
unite(data, dia, Mês, Ano, sep = "/", remove = FALSE) %>%
pivot_wider(names_from = `Município/Posto`)
rainfall_wide_df %>%
group_by(Ano, Mês) %>%
summarise_at(vars(`Agua Branca`:`Zabelê`), sum)
This results in:
# A tibble: 296 x 253
# Groups: Ano [26]
Ano Mês `Agua Branca` Aguiar `Alagoa Grande` `Alagoa Nova` Alagoinha Alcantil `Algodão de Jan…
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1994 1 174. 442. 101 68.5 64.6 NA NA
2 1994 2 NA NA NA NA NA NA NA
3 1994 3 285. 120. 239. 210. 213. NA NA
4 1994 4 NA NA NA NA NA NA NA
5 1994 5 176. 73.2 160. 233. 190 NA 41.8
6 1994 6 NA NA NA NA NA NA NA
7 1994 7 55.6 33.3 292. 188. 291. NA 51.4
8 1994 8 28 0 60.8 68.1 57.6 NA 16.1
9 1994 9 NA NA NA NA NA NA NA
10 1994 10 20 0 8.8 9.3 3.6 NA 0
# … with 286 more rows, and 244 more variables
2b) This is an alternative solution to get the sums for each station and month. Which is easier to work with for further steps (visualization in ggplot2 especially). Also I feel, that the code is more straight forward!
rainfall_long_df %>%
group_by(`Município/Posto`, Ano, Mês) %>%
summarise(rainfall_per_month = sum(value))
The result will be a long version of the sum of rainfall per month and station.
# A tibble: 62,319 x 4
# Groups: Município/Posto, Ano [5,522]
`Município/Posto` Ano Mês rainfall_per_month
<chr> <dbl> <dbl> <dbl>
1 Agua Branca 1994 1 174.
2 Agua Branca 1994 2 NA
3 Agua Branca 1994 3 285.
4 Agua Branca 1994 4 NA
5 Agua Branca 1994 5 176.
6 Agua Branca 1994 6 NA
7 Agua Branca 1994 7 55.6
8 Agua Branca 1994 8 28
9 Agua Branca 1994 9 NA
10 Agua Branca 1994 10 20
# … with 62,309 more rows
First, I would like to thank you for your responses. Second, I apologize for the question that is not in the correct structure (my first time here), I am also new to the universe of R. I am using this data as part of a hydrology study and this structure is necessary to use the HydroTSM package and later on SWAT.
I did the recommended tests, but some questions came up. and both attended to the resolution of my problem. But, I realized that when the dates were created, the leap years had a small problem, however I removed these dates manually.
How could you do to consider leap years in building the database?
Thank you.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.