简体   繁体   中英

R - compare multiple columns of two dataframes

I have two datasets of different nrows:

Shallow water dataset

  site   depth       date data delta  yearmon
1  ARN Shallow 2003-01-22 0.04  0.00 Jan 2003
2  ARN Shallow 2003-04-28 0.00 -0.04 Apr 2003
3  ARN Shallow 2003-05-28 0.00  0.00 May 2003
4  ARN Shallow 2003-06-24 0.00  0.00 Jun 2003
5  ARN Shallow 2003-08-26 0.03  0.03 Aug 2003
6  ARN Shallow 2003-09-23 0.03  0.00 Sep 2003
...
2946  DC1 Shallow 2020-01-15 0.09 -0.03 Jan 2020
2947  DC1 Shallow 2020-02-15 0.12  0.03 Feb 2020
2948  DC1 Shallow 2020-03-15 0.13  0.01 Mar 2020
2949  BCI Shallow 2020-01-15 0.25 -0.02 Jan 2020
2950  BCI Shallow 2020-02-15 0.30  0.05 Feb 2020
2951  BCI Shallow 2020-03-15 0.33  0.03 Mar 2020


Deep water dataset

  site depth       date data delta  yearmon
1  ARN  Deep 2003-01-22 0.04  0.00 Jan 2003
2  ARN  Deep 2003-04-28 0.00 -0.04 Apr 2003
3  ARN  Deep 2003-05-28 0.00  0.00 May 2003
4  ARN  Deep 2003-06-24 0.00  0.00 Jun 2003
5  ARN  Deep 2003-08-26 0.02  0.02 Aug 2003
6  ARN  Deep 2003-09-23 0.02  0.00 Sep 2003
...
2578  DC1  Deep 2020-01-15 0.09  0.04 Jan 2020
2579  DC1  Deep 2020-02-15 0.12  0.03 Feb 2020
2580  DC1  Deep 2020-03-15 0.13  0.01 Mar 2020
2581  BCI  Deep 2020-01-15 0.25 -0.03 Jan 2020
2582  BCI  Deep 2020-02-15 0.31  0.06 Feb 2020
2583  BCI  Deep 2020-03-15 0.34  0.03 Mar 2020

There are multiple sites, some with missing entries, so the rows are not going to show the same site and date across datasets throughout their entire lengths.

I'd like to have a separate dataframe, with the averages of the data column of the two datasets, for every entry that has same site and date in both. For example:

  site depth       date data delta  yearmon
1  ARN  AVG 2003-01-22 0.04  0.00 Jan 2003
2  ARN  AVG 2003-04-28 0.00 -0.04 Apr 2003
3  ARN  AVG 2003-05-28 0.00  0.00 May 2003
4  ARN  AVG 2003-06-24 0.00  0.00 Jun 2003
5  ARN  AVG 2003-08-26 0.025  0.02 Aug 2003
6  ARN  AVG 2003-09-23 0.025  0.00 Sep 2003
...
?  DC1  AVG 2020-01-15 0.09  0.04 Jan 2020
?  DC1  AVG 2020-02-15 0.12  0.03 Feb 2020
?  DC1  AVG 2020-03-15 0.13  0.01 Mar 2020
?  BCI  AVG 2020-01-15 0.25 -0.03 Jan 2020
?  BCI  AVG 2020-02-15 0.305  0.06 Feb 2020
?  BCI  AVG 2020-03-15 0.335  0.03 Mar 2020

The problem is, I got no clue how to deal with the different lengths issue. Best I could think of was this:

if(shallow$yearmon == deep$yearmon & shallow$site == deep$site){
  shallow$avg <- mean(shallow$data, deep$data)
}

Which evidently does not work.

Plz help?

How about something like this? I made up data just to give a simple idea:

df_shallow <- data.frame(
  site = c(rep("ARN",2),rep("DC1",2)),
  depth = "Shallow",
  data = rnorm(4,0,0.1),
  yearmon = c("Jan 2003","Apr 2003","Jan 2003","Feb 2003")
)

df_deep <- data.frame(
  site = c(rep("ARN",2),c("DC1","DC2")),
  depth = "Deep",
  data = rnorm(4,0,0.1),
  yearmon = c("Jan 2003","Mar 2003","Jan 2003","Feb 2003")
)

# install.packages("tidyverse")
library(tidyverse)
df_avg <- full_join(df_shallow,df_deep) %>% 
  pivot_wider(id_cols = c(site,yearmon),names_from = depth,values_from = data) %>%
  mutate(AVG = (Shallow + Deep) / 2) %>%
  filter(!is.na(AVG)) %>% 
  select(-Shallow, -Deep) %>%
  mutate(depth = "AVG") %>%
  rename(data = AVG)

Which would give you something like this:

> head(df_avg)
# A tibble: 2 × 4
  site  yearmon     data depth
  <chr> <chr>      <dbl> <chr>
1 ARN   Jan 2003  0.146  AVG  
2 DC1   Jan 2003 -0.0998 AVG 

You could, of course, move the columns around to match whatever order you need.

Considering that site and date are unique in shallow and deep you could approach the problem like this (I altered the dummy data a bit):

deep <- data.table::fread("  site depth       date data delta  mon year
ARN  Deep 2003-01-22 0.04  0.00 Jan 2003
ARN  Deep 2003-04-28 0.00 -0.04 Apr 2003
ARN  Deep 2003-05-28 0.00  0.00 May 2003
ARN  Deep 2003-06-24 0.00  0.00 Jun 2003
ARN  Deep 2003-08-26 0.02  0.02 Aug 2003
ARN  Deep 2003-09-23 0.02  0.00 Sep 2003")

shallow <- data.table::fread("  site depth       date data delta  mon year
ARN Shallow 2003-01-22 0.06  0.00 Jan 2003
ARN Shallow 2003-04-28 0.01 -0.02 Apr 2003
ARN Shallow 2003-05-28 0.05  0.00 May 2003
ARN Shallow 2003-06-24 0.00  0.00 Jun 2003
ARN Shallow 2003-08-26 0.03  0.03 Aug 2003
ARN Shallow 2003-09-23 0.03  0.00 Sep 2003")

df <- deep %>% 
    # inner join eliminates all none matching rows from both dfs
    dplyr::inner_join(shallow, by = c("site","date"))

# note sure if you mean this result:
df %>%
    dplyr::group_by(site) %>%
    dplyr::summarise(deep = mean(data.x), shallow = mean(data.y)) %>%
    dplyr::ungroup()

# A tibble: 1 x 3
  site    deep shallow
  <chr>  <dbl>   <dbl>
1 ARN   0.0133    0.03

# or this
df %>% 
    # maybe you want to make sure there are no NAs left
    dplyr::filter(!is.na(data.x) & !is.na(data.y)) %>%
    # select and calculate in the same step
    dplyr::transmute(site, date, mean_data = (data.x + data.y)/2)

   site       date mean_data
1:  ARN 2003-01-22     0.050
2:  ARN 2003-04-28     0.005
3:  ARN 2003-05-28     0.025
4:  ARN 2003-06-24     0.000
5:  ARN 2003-08-26     0.025
6:  ARN 2003-09-23     0.025

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM