簡體   English   中英

R - 比較兩個數據幀的多列

[英]R - compare multiple columns of two dataframes

我有兩個不同 nrows 的數據集:

Shallow water dataset

  site   depth       date data delta  yearmon
1  ARN Shallow 2003-01-22 0.04  0.00 Jan 2003
2  ARN Shallow 2003-04-28 0.00 -0.04 Apr 2003
3  ARN Shallow 2003-05-28 0.00  0.00 May 2003
4  ARN Shallow 2003-06-24 0.00  0.00 Jun 2003
5  ARN Shallow 2003-08-26 0.03  0.03 Aug 2003
6  ARN Shallow 2003-09-23 0.03  0.00 Sep 2003
...
2946  DC1 Shallow 2020-01-15 0.09 -0.03 Jan 2020
2947  DC1 Shallow 2020-02-15 0.12  0.03 Feb 2020
2948  DC1 Shallow 2020-03-15 0.13  0.01 Mar 2020
2949  BCI Shallow 2020-01-15 0.25 -0.02 Jan 2020
2950  BCI Shallow 2020-02-15 0.30  0.05 Feb 2020
2951  BCI Shallow 2020-03-15 0.33  0.03 Mar 2020


Deep water dataset

  site depth       date data delta  yearmon
1  ARN  Deep 2003-01-22 0.04  0.00 Jan 2003
2  ARN  Deep 2003-04-28 0.00 -0.04 Apr 2003
3  ARN  Deep 2003-05-28 0.00  0.00 May 2003
4  ARN  Deep 2003-06-24 0.00  0.00 Jun 2003
5  ARN  Deep 2003-08-26 0.02  0.02 Aug 2003
6  ARN  Deep 2003-09-23 0.02  0.00 Sep 2003
...
2578  DC1  Deep 2020-01-15 0.09  0.04 Jan 2020
2579  DC1  Deep 2020-02-15 0.12  0.03 Feb 2020
2580  DC1  Deep 2020-03-15 0.13  0.01 Mar 2020
2581  BCI  Deep 2020-01-15 0.25 -0.03 Jan 2020
2582  BCI  Deep 2020-02-15 0.31  0.06 Feb 2020
2583  BCI  Deep 2020-03-15 0.34  0.03 Mar 2020

有多個站點,其中一些缺少條目,因此行不會在整個長度的整個數據集中顯示相同的站點和日期。

我想有一個單獨的數據框,其中包含兩個數據集的數據列的平均值,用於兩個具有相同站點和日期的每個條目。 例如:

  site depth       date data delta  yearmon
1  ARN  AVG 2003-01-22 0.04  0.00 Jan 2003
2  ARN  AVG 2003-04-28 0.00 -0.04 Apr 2003
3  ARN  AVG 2003-05-28 0.00  0.00 May 2003
4  ARN  AVG 2003-06-24 0.00  0.00 Jun 2003
5  ARN  AVG 2003-08-26 0.025  0.02 Aug 2003
6  ARN  AVG 2003-09-23 0.025  0.00 Sep 2003
...
?  DC1  AVG 2020-01-15 0.09  0.04 Jan 2020
?  DC1  AVG 2020-02-15 0.12  0.03 Feb 2020
?  DC1  AVG 2020-03-15 0.13  0.01 Mar 2020
?  BCI  AVG 2020-01-15 0.25 -0.03 Jan 2020
?  BCI  AVG 2020-02-15 0.305  0.06 Feb 2020
?  BCI  AVG 2020-03-15 0.335  0.03 Mar 2020

問題是,我不知道如何處理不同長度的問題。 我能想到的最好的是:

if(shallow$yearmon == deep$yearmon & shallow$site == deep$site){
  shallow$avg <- mean(shallow$data, deep$data)
}

這顯然不起作用。

請幫忙?

這樣的事情怎么樣? 我編造數據只是為了給出一個簡單的想法:

df_shallow <- data.frame(
  site = c(rep("ARN",2),rep("DC1",2)),
  depth = "Shallow",
  data = rnorm(4,0,0.1),
  yearmon = c("Jan 2003","Apr 2003","Jan 2003","Feb 2003")
)

df_deep <- data.frame(
  site = c(rep("ARN",2),c("DC1","DC2")),
  depth = "Deep",
  data = rnorm(4,0,0.1),
  yearmon = c("Jan 2003","Mar 2003","Jan 2003","Feb 2003")
)

# install.packages("tidyverse")
library(tidyverse)
df_avg <- full_join(df_shallow,df_deep) %>% 
  pivot_wider(id_cols = c(site,yearmon),names_from = depth,values_from = data) %>%
  mutate(AVG = (Shallow + Deep) / 2) %>%
  filter(!is.na(AVG)) %>% 
  select(-Shallow, -Deep) %>%
  mutate(depth = "AVG") %>%
  rename(data = AVG)

這會給你這樣的東西:

> head(df_avg)
# A tibble: 2 × 4
  site  yearmon     data depth
  <chr> <chr>      <dbl> <chr>
1 ARN   Jan 2003  0.146  AVG  
2 DC1   Jan 2003 -0.0998 AVG 

當然,您可以移動列以匹配您需要的任何順序。

考慮到該站點和日期在淺層和深層都是獨一無二的,您可以像這樣處理問題(我稍微更改了虛擬數據):

deep <- data.table::fread("  site depth       date data delta  mon year
ARN  Deep 2003-01-22 0.04  0.00 Jan 2003
ARN  Deep 2003-04-28 0.00 -0.04 Apr 2003
ARN  Deep 2003-05-28 0.00  0.00 May 2003
ARN  Deep 2003-06-24 0.00  0.00 Jun 2003
ARN  Deep 2003-08-26 0.02  0.02 Aug 2003
ARN  Deep 2003-09-23 0.02  0.00 Sep 2003")

shallow <- data.table::fread("  site depth       date data delta  mon year
ARN Shallow 2003-01-22 0.06  0.00 Jan 2003
ARN Shallow 2003-04-28 0.01 -0.02 Apr 2003
ARN Shallow 2003-05-28 0.05  0.00 May 2003
ARN Shallow 2003-06-24 0.00  0.00 Jun 2003
ARN Shallow 2003-08-26 0.03  0.03 Aug 2003
ARN Shallow 2003-09-23 0.03  0.00 Sep 2003")

df <- deep %>% 
    # inner join eliminates all none matching rows from both dfs
    dplyr::inner_join(shallow, by = c("site","date"))

# note sure if you mean this result:
df %>%
    dplyr::group_by(site) %>%
    dplyr::summarise(deep = mean(data.x), shallow = mean(data.y)) %>%
    dplyr::ungroup()

# A tibble: 1 x 3
  site    deep shallow
  <chr>  <dbl>   <dbl>
1 ARN   0.0133    0.03

# or this
df %>% 
    # maybe you want to make sure there are no NAs left
    dplyr::filter(!is.na(data.x) & !is.na(data.y)) %>%
    # select and calculate in the same step
    dplyr::transmute(site, date, mean_data = (data.x + data.y)/2)

   site       date mean_data
1:  ARN 2003-01-22     0.050
2:  ARN 2003-04-28     0.005
3:  ARN 2003-05-28     0.025
4:  ARN 2003-06-24     0.000
5:  ARN 2003-08-26     0.025
6:  ARN 2003-09-23     0.025

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM