简体   繁体   English

R - 比较两个数据帧的多列

[英]R - compare multiple columns of two dataframes

I have two datasets of different nrows:我有两个不同 nrows 的数据集:

Shallow water dataset

  site   depth       date data delta  yearmon
1  ARN Shallow 2003-01-22 0.04  0.00 Jan 2003
2  ARN Shallow 2003-04-28 0.00 -0.04 Apr 2003
3  ARN Shallow 2003-05-28 0.00  0.00 May 2003
4  ARN Shallow 2003-06-24 0.00  0.00 Jun 2003
5  ARN Shallow 2003-08-26 0.03  0.03 Aug 2003
6  ARN Shallow 2003-09-23 0.03  0.00 Sep 2003
...
2946  DC1 Shallow 2020-01-15 0.09 -0.03 Jan 2020
2947  DC1 Shallow 2020-02-15 0.12  0.03 Feb 2020
2948  DC1 Shallow 2020-03-15 0.13  0.01 Mar 2020
2949  BCI Shallow 2020-01-15 0.25 -0.02 Jan 2020
2950  BCI Shallow 2020-02-15 0.30  0.05 Feb 2020
2951  BCI Shallow 2020-03-15 0.33  0.03 Mar 2020


Deep water dataset

  site depth       date data delta  yearmon
1  ARN  Deep 2003-01-22 0.04  0.00 Jan 2003
2  ARN  Deep 2003-04-28 0.00 -0.04 Apr 2003
3  ARN  Deep 2003-05-28 0.00  0.00 May 2003
4  ARN  Deep 2003-06-24 0.00  0.00 Jun 2003
5  ARN  Deep 2003-08-26 0.02  0.02 Aug 2003
6  ARN  Deep 2003-09-23 0.02  0.00 Sep 2003
...
2578  DC1  Deep 2020-01-15 0.09  0.04 Jan 2020
2579  DC1  Deep 2020-02-15 0.12  0.03 Feb 2020
2580  DC1  Deep 2020-03-15 0.13  0.01 Mar 2020
2581  BCI  Deep 2020-01-15 0.25 -0.03 Jan 2020
2582  BCI  Deep 2020-02-15 0.31  0.06 Feb 2020
2583  BCI  Deep 2020-03-15 0.34  0.03 Mar 2020

There are multiple sites, some with missing entries, so the rows are not going to show the same site and date across datasets throughout their entire lengths.有多个站点,其中一些缺少条目,因此行不会在整个长度的整个数据集中显示相同的站点和日期。

I'd like to have a separate dataframe, with the averages of the data column of the two datasets, for every entry that has same site and date in both.我想有一个单独的数据框,其中包含两个数据集的数据列的平均值,用于两个具有相同站点和日期的每个条目。 For example:例如:

  site depth       date data delta  yearmon
1  ARN  AVG 2003-01-22 0.04  0.00 Jan 2003
2  ARN  AVG 2003-04-28 0.00 -0.04 Apr 2003
3  ARN  AVG 2003-05-28 0.00  0.00 May 2003
4  ARN  AVG 2003-06-24 0.00  0.00 Jun 2003
5  ARN  AVG 2003-08-26 0.025  0.02 Aug 2003
6  ARN  AVG 2003-09-23 0.025  0.00 Sep 2003
...
?  DC1  AVG 2020-01-15 0.09  0.04 Jan 2020
?  DC1  AVG 2020-02-15 0.12  0.03 Feb 2020
?  DC1  AVG 2020-03-15 0.13  0.01 Mar 2020
?  BCI  AVG 2020-01-15 0.25 -0.03 Jan 2020
?  BCI  AVG 2020-02-15 0.305  0.06 Feb 2020
?  BCI  AVG 2020-03-15 0.335  0.03 Mar 2020

The problem is, I got no clue how to deal with the different lengths issue.问题是,我不知道如何处理不同长度的问题。 Best I could think of was this:我能想到的最好的是:

if(shallow$yearmon == deep$yearmon & shallow$site == deep$site){
  shallow$avg <- mean(shallow$data, deep$data)
}

Which evidently does not work.这显然不起作用。

Plz help?请帮忙?

How about something like this?这样的事情怎么样? I made up data just to give a simple idea:我编造数据只是为了给出一个简单的想法:

df_shallow <- data.frame(
  site = c(rep("ARN",2),rep("DC1",2)),
  depth = "Shallow",
  data = rnorm(4,0,0.1),
  yearmon = c("Jan 2003","Apr 2003","Jan 2003","Feb 2003")
)

df_deep <- data.frame(
  site = c(rep("ARN",2),c("DC1","DC2")),
  depth = "Deep",
  data = rnorm(4,0,0.1),
  yearmon = c("Jan 2003","Mar 2003","Jan 2003","Feb 2003")
)

# install.packages("tidyverse")
library(tidyverse)
df_avg <- full_join(df_shallow,df_deep) %>% 
  pivot_wider(id_cols = c(site,yearmon),names_from = depth,values_from = data) %>%
  mutate(AVG = (Shallow + Deep) / 2) %>%
  filter(!is.na(AVG)) %>% 
  select(-Shallow, -Deep) %>%
  mutate(depth = "AVG") %>%
  rename(data = AVG)

Which would give you something like this:这会给你这样的东西:

> head(df_avg)
# A tibble: 2 × 4
  site  yearmon     data depth
  <chr> <chr>      <dbl> <chr>
1 ARN   Jan 2003  0.146  AVG  
2 DC1   Jan 2003 -0.0998 AVG 

You could, of course, move the columns around to match whatever order you need.当然,您可以移动列以匹配您需要的任何顺序。

Considering that site and date are unique in shallow and deep you could approach the problem like this (I altered the dummy data a bit):考虑到该站点和日期在浅层和深层都是独一无二的,您可以像这样处理问题(我稍微更改了虚拟数据):

deep <- data.table::fread("  site depth       date data delta  mon year
ARN  Deep 2003-01-22 0.04  0.00 Jan 2003
ARN  Deep 2003-04-28 0.00 -0.04 Apr 2003
ARN  Deep 2003-05-28 0.00  0.00 May 2003
ARN  Deep 2003-06-24 0.00  0.00 Jun 2003
ARN  Deep 2003-08-26 0.02  0.02 Aug 2003
ARN  Deep 2003-09-23 0.02  0.00 Sep 2003")

shallow <- data.table::fread("  site depth       date data delta  mon year
ARN Shallow 2003-01-22 0.06  0.00 Jan 2003
ARN Shallow 2003-04-28 0.01 -0.02 Apr 2003
ARN Shallow 2003-05-28 0.05  0.00 May 2003
ARN Shallow 2003-06-24 0.00  0.00 Jun 2003
ARN Shallow 2003-08-26 0.03  0.03 Aug 2003
ARN Shallow 2003-09-23 0.03  0.00 Sep 2003")

df <- deep %>% 
    # inner join eliminates all none matching rows from both dfs
    dplyr::inner_join(shallow, by = c("site","date"))

# note sure if you mean this result:
df %>%
    dplyr::group_by(site) %>%
    dplyr::summarise(deep = mean(data.x), shallow = mean(data.y)) %>%
    dplyr::ungroup()

# A tibble: 1 x 3
  site    deep shallow
  <chr>  <dbl>   <dbl>
1 ARN   0.0133    0.03

# or this
df %>% 
    # maybe you want to make sure there are no NAs left
    dplyr::filter(!is.na(data.x) & !is.na(data.y)) %>%
    # select and calculate in the same step
    dplyr::transmute(site, date, mean_data = (data.x + data.y)/2)

   site       date mean_data
1:  ARN 2003-01-22     0.050
2:  ARN 2003-04-28     0.005
3:  ARN 2003-05-28     0.025
4:  ARN 2003-06-24     0.000
5:  ARN 2003-08-26     0.025
6:  ARN 2003-09-23     0.025

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM