I have two datasets of different nrows:
Shallow water dataset
site depth date data delta yearmon
1 ARN Shallow 2003-01-22 0.04 0.00 Jan 2003
2 ARN Shallow 2003-04-28 0.00 -0.04 Apr 2003
3 ARN Shallow 2003-05-28 0.00 0.00 May 2003
4 ARN Shallow 2003-06-24 0.00 0.00 Jun 2003
5 ARN Shallow 2003-08-26 0.03 0.03 Aug 2003
6 ARN Shallow 2003-09-23 0.03 0.00 Sep 2003
...
2946 DC1 Shallow 2020-01-15 0.09 -0.03 Jan 2020
2947 DC1 Shallow 2020-02-15 0.12 0.03 Feb 2020
2948 DC1 Shallow 2020-03-15 0.13 0.01 Mar 2020
2949 BCI Shallow 2020-01-15 0.25 -0.02 Jan 2020
2950 BCI Shallow 2020-02-15 0.30 0.05 Feb 2020
2951 BCI Shallow 2020-03-15 0.33 0.03 Mar 2020
Deep water dataset
site depth date data delta yearmon
1 ARN Deep 2003-01-22 0.04 0.00 Jan 2003
2 ARN Deep 2003-04-28 0.00 -0.04 Apr 2003
3 ARN Deep 2003-05-28 0.00 0.00 May 2003
4 ARN Deep 2003-06-24 0.00 0.00 Jun 2003
5 ARN Deep 2003-08-26 0.02 0.02 Aug 2003
6 ARN Deep 2003-09-23 0.02 0.00 Sep 2003
...
2578 DC1 Deep 2020-01-15 0.09 0.04 Jan 2020
2579 DC1 Deep 2020-02-15 0.12 0.03 Feb 2020
2580 DC1 Deep 2020-03-15 0.13 0.01 Mar 2020
2581 BCI Deep 2020-01-15 0.25 -0.03 Jan 2020
2582 BCI Deep 2020-02-15 0.31 0.06 Feb 2020
2583 BCI Deep 2020-03-15 0.34 0.03 Mar 2020
There are multiple sites, some with missing entries, so the rows are not going to show the same site and date across datasets throughout their entire lengths.
I'd like to have a separate dataframe, with the averages of the data column of the two datasets, for every entry that has same site and date in both. For example:
site depth date data delta yearmon
1 ARN AVG 2003-01-22 0.04 0.00 Jan 2003
2 ARN AVG 2003-04-28 0.00 -0.04 Apr 2003
3 ARN AVG 2003-05-28 0.00 0.00 May 2003
4 ARN AVG 2003-06-24 0.00 0.00 Jun 2003
5 ARN AVG 2003-08-26 0.025 0.02 Aug 2003
6 ARN AVG 2003-09-23 0.025 0.00 Sep 2003
...
? DC1 AVG 2020-01-15 0.09 0.04 Jan 2020
? DC1 AVG 2020-02-15 0.12 0.03 Feb 2020
? DC1 AVG 2020-03-15 0.13 0.01 Mar 2020
? BCI AVG 2020-01-15 0.25 -0.03 Jan 2020
? BCI AVG 2020-02-15 0.305 0.06 Feb 2020
? BCI AVG 2020-03-15 0.335 0.03 Mar 2020
The problem is, I got no clue how to deal with the different lengths issue. Best I could think of was this:
if(shallow$yearmon == deep$yearmon & shallow$site == deep$site){
shallow$avg <- mean(shallow$data, deep$data)
}
Which evidently does not work.
Plz help?
How about something like this? I made up data just to give a simple idea:
df_shallow <- data.frame(
site = c(rep("ARN",2),rep("DC1",2)),
depth = "Shallow",
data = rnorm(4,0,0.1),
yearmon = c("Jan 2003","Apr 2003","Jan 2003","Feb 2003")
)
df_deep <- data.frame(
site = c(rep("ARN",2),c("DC1","DC2")),
depth = "Deep",
data = rnorm(4,0,0.1),
yearmon = c("Jan 2003","Mar 2003","Jan 2003","Feb 2003")
)
# install.packages("tidyverse")
library(tidyverse)
df_avg <- full_join(df_shallow,df_deep) %>%
pivot_wider(id_cols = c(site,yearmon),names_from = depth,values_from = data) %>%
mutate(AVG = (Shallow + Deep) / 2) %>%
filter(!is.na(AVG)) %>%
select(-Shallow, -Deep) %>%
mutate(depth = "AVG") %>%
rename(data = AVG)
Which would give you something like this:
> head(df_avg)
# A tibble: 2 × 4
site yearmon data depth
<chr> <chr> <dbl> <chr>
1 ARN Jan 2003 0.146 AVG
2 DC1 Jan 2003 -0.0998 AVG
You could, of course, move the columns around to match whatever order you need.
Considering that site and date are unique in shallow and deep you could approach the problem like this (I altered the dummy data a bit):
deep <- data.table::fread(" site depth date data delta mon year
ARN Deep 2003-01-22 0.04 0.00 Jan 2003
ARN Deep 2003-04-28 0.00 -0.04 Apr 2003
ARN Deep 2003-05-28 0.00 0.00 May 2003
ARN Deep 2003-06-24 0.00 0.00 Jun 2003
ARN Deep 2003-08-26 0.02 0.02 Aug 2003
ARN Deep 2003-09-23 0.02 0.00 Sep 2003")
shallow <- data.table::fread(" site depth date data delta mon year
ARN Shallow 2003-01-22 0.06 0.00 Jan 2003
ARN Shallow 2003-04-28 0.01 -0.02 Apr 2003
ARN Shallow 2003-05-28 0.05 0.00 May 2003
ARN Shallow 2003-06-24 0.00 0.00 Jun 2003
ARN Shallow 2003-08-26 0.03 0.03 Aug 2003
ARN Shallow 2003-09-23 0.03 0.00 Sep 2003")
df <- deep %>%
# inner join eliminates all none matching rows from both dfs
dplyr::inner_join(shallow, by = c("site","date"))
# note sure if you mean this result:
df %>%
dplyr::group_by(site) %>%
dplyr::summarise(deep = mean(data.x), shallow = mean(data.y)) %>%
dplyr::ungroup()
# A tibble: 1 x 3
site deep shallow
<chr> <dbl> <dbl>
1 ARN 0.0133 0.03
# or this
df %>%
# maybe you want to make sure there are no NAs left
dplyr::filter(!is.na(data.x) & !is.na(data.y)) %>%
# select and calculate in the same step
dplyr::transmute(site, date, mean_data = (data.x + data.y)/2)
site date mean_data
1: ARN 2003-01-22 0.050
2: ARN 2003-04-28 0.005
3: ARN 2003-05-28 0.025
4: ARN 2003-06-24 0.000
5: ARN 2003-08-26 0.025
6: ARN 2003-09-23 0.025
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.