R - 比较两个数据帧的多列

Question

I have two datasets of different nrows:我有两个不同 nrows 的数据集：

Shallow water dataset

  site   depth       date data delta  yearmon
1  ARN Shallow 2003-01-22 0.04  0.00 Jan 2003
2  ARN Shallow 2003-04-28 0.00 -0.04 Apr 2003
3  ARN Shallow 2003-05-28 0.00  0.00 May 2003
4  ARN Shallow 2003-06-24 0.00  0.00 Jun 2003
5  ARN Shallow 2003-08-26 0.03  0.03 Aug 2003
6  ARN Shallow 2003-09-23 0.03  0.00 Sep 2003
...
2946  DC1 Shallow 2020-01-15 0.09 -0.03 Jan 2020
2947  DC1 Shallow 2020-02-15 0.12  0.03 Feb 2020
2948  DC1 Shallow 2020-03-15 0.13  0.01 Mar 2020
2949  BCI Shallow 2020-01-15 0.25 -0.02 Jan 2020
2950  BCI Shallow 2020-02-15 0.30  0.05 Feb 2020
2951  BCI Shallow 2020-03-15 0.33  0.03 Mar 2020


Deep water dataset

  site depth       date data delta  yearmon
1  ARN  Deep 2003-01-22 0.04  0.00 Jan 2003
2  ARN  Deep 2003-04-28 0.00 -0.04 Apr 2003
3  ARN  Deep 2003-05-28 0.00  0.00 May 2003
4  ARN  Deep 2003-06-24 0.00  0.00 Jun 2003
5  ARN  Deep 2003-08-26 0.02  0.02 Aug 2003
6  ARN  Deep 2003-09-23 0.02  0.00 Sep 2003
...
2578  DC1  Deep 2020-01-15 0.09  0.04 Jan 2020
2579  DC1  Deep 2020-02-15 0.12  0.03 Feb 2020
2580  DC1  Deep 2020-03-15 0.13  0.01 Mar 2020
2581  BCI  Deep 2020-01-15 0.25 -0.03 Jan 2020
2582  BCI  Deep 2020-02-15 0.31  0.06 Feb 2020
2583  BCI  Deep 2020-03-15 0.34  0.03 Mar 2020

There are multiple sites, some with missing entries, so the rows are not going to show the same site and date across datasets throughout their entire lengths.有多个站点，其中一些缺少条目，因此行不会在整个长度的整个数据集中显示相同的站点和日期。

I'd like to have a separate dataframe, with the averages of the data column of the two datasets, for every entry that has same site and date in both.我想有一个单独的数据框，其中包含两个数据集的数据列的平均值，用于两个具有相同站点和日期的每个条目。 For example:例如：

  site depth       date data delta  yearmon
1  ARN  AVG 2003-01-22 0.04  0.00 Jan 2003
2  ARN  AVG 2003-04-28 0.00 -0.04 Apr 2003
3  ARN  AVG 2003-05-28 0.00  0.00 May 2003
4  ARN  AVG 2003-06-24 0.00  0.00 Jun 2003
5  ARN  AVG 2003-08-26 0.025  0.02 Aug 2003
6  ARN  AVG 2003-09-23 0.025  0.00 Sep 2003
...
?  DC1  AVG 2020-01-15 0.09  0.04 Jan 2020
?  DC1  AVG 2020-02-15 0.12  0.03 Feb 2020
?  DC1  AVG 2020-03-15 0.13  0.01 Mar 2020
?  BCI  AVG 2020-01-15 0.25 -0.03 Jan 2020
?  BCI  AVG 2020-02-15 0.305  0.06 Feb 2020
?  BCI  AVG 2020-03-15 0.335  0.03 Mar 2020

The problem is, I got no clue how to deal with the different lengths issue.问题是，我不知道如何处理不同长度的问题。 Best I could think of was this:我能想到的最好的是：

if(shallow$yearmon == deep$yearmon & shallow$site == deep$site){
  shallow$avg <- mean(shallow$data, deep$data)
}

Which evidently does not work.这显然不起作用。

Plz help?请帮忙？

Answer 1

How about something like this?这样的事情怎么样？ I made up data just to give a simple idea:我编造数据只是为了给出一个简单的想法：

df_shallow <- data.frame(
  site = c(rep("ARN",2),rep("DC1",2)),
  depth = "Shallow",
  data = rnorm(4,0,0.1),
  yearmon = c("Jan 2003","Apr 2003","Jan 2003","Feb 2003")
)

df_deep <- data.frame(
  site = c(rep("ARN",2),c("DC1","DC2")),
  depth = "Deep",
  data = rnorm(4,0,0.1),
  yearmon = c("Jan 2003","Mar 2003","Jan 2003","Feb 2003")
)

# install.packages("tidyverse")
library(tidyverse)
df_avg <- full_join(df_shallow,df_deep) %>% 
  pivot_wider(id_cols = c(site,yearmon),names_from = depth,values_from = data) %>%
  mutate(AVG = (Shallow + Deep) / 2) %>%
  filter(!is.na(AVG)) %>% 
  select(-Shallow, -Deep) %>%
  mutate(depth = "AVG") %>%
  rename(data = AVG)

Which would give you something like this:这会给你这样的东西：

> head(df_avg)
# A tibble: 2 × 4
  site  yearmon     data depth
  <chr> <chr>      <dbl> <chr>
1 ARN   Jan 2003  0.146  AVG  
2 DC1   Jan 2003 -0.0998 AVG

You could, of course, move the columns around to match whatever order you need.当然，您可以移动列以匹配您需要的任何顺序。

Answer 2

Considering that site and date are unique in shallow and deep you could approach the problem like this (I altered the dummy data a bit):考虑到该站点和日期在浅层和深层都是独一无二的，您可以像这样处理问题（我稍微更改了虚拟数据）：

deep <- data.table::fread("  site depth       date data delta  mon year
ARN  Deep 2003-01-22 0.04  0.00 Jan 2003
ARN  Deep 2003-04-28 0.00 -0.04 Apr 2003
ARN  Deep 2003-05-28 0.00  0.00 May 2003
ARN  Deep 2003-06-24 0.00  0.00 Jun 2003
ARN  Deep 2003-08-26 0.02  0.02 Aug 2003
ARN  Deep 2003-09-23 0.02  0.00 Sep 2003")

shallow <- data.table::fread("  site depth       date data delta  mon year
ARN Shallow 2003-01-22 0.06  0.00 Jan 2003
ARN Shallow 2003-04-28 0.01 -0.02 Apr 2003
ARN Shallow 2003-05-28 0.05  0.00 May 2003
ARN Shallow 2003-06-24 0.00  0.00 Jun 2003
ARN Shallow 2003-08-26 0.03  0.03 Aug 2003
ARN Shallow 2003-09-23 0.03  0.00 Sep 2003")

df <- deep %>% 
    # inner join eliminates all none matching rows from both dfs
    dplyr::inner_join(shallow, by = c("site","date"))

# note sure if you mean this result:
df %>%
    dplyr::group_by(site) %>%
    dplyr::summarise(deep = mean(data.x), shallow = mean(data.y)) %>%
    dplyr::ungroup()

# A tibble: 1 x 3
  site    deep shallow
  <chr>  <dbl>   <dbl>
1 ARN   0.0133    0.03

# or this
df %>% 
    # maybe you want to make sure there are no NAs left
    dplyr::filter(!is.na(data.x) & !is.na(data.y)) %>%
    # select and calculate in the same step
    dplyr::transmute(site, date, mean_data = (data.x + data.y)/2)

   site       date mean_data
1:  ARN 2003-01-22     0.050
2:  ARN 2003-04-28     0.005
3:  ARN 2003-05-28     0.025
4:  ARN 2003-06-24     0.000
5:  ARN 2003-08-26     0.025
6:  ARN 2003-09-23     0.025

R - 比较两个数据帧的多列

问题描述

2 个解决方案

解决方案1
3 2021-07-27 19:08:37

解决方案2
2 已采纳 2021-07-27 19:24:56

R - 比较两个数据帧的多列

问题描述

2 个解决方案

解决方案1 3 2021-07-27 19:08:37

解决方案2 2 已采纳 2021-07-27 19:24:56

解决方案1
3 2021-07-27 19:08:37

解决方案2
2 已采纳 2021-07-27 19:24:56