[英]R - compare multiple columns of two dataframes
我有兩個不同 nrows 的數據集:
Shallow water dataset
site depth date data delta yearmon
1 ARN Shallow 2003-01-22 0.04 0.00 Jan 2003
2 ARN Shallow 2003-04-28 0.00 -0.04 Apr 2003
3 ARN Shallow 2003-05-28 0.00 0.00 May 2003
4 ARN Shallow 2003-06-24 0.00 0.00 Jun 2003
5 ARN Shallow 2003-08-26 0.03 0.03 Aug 2003
6 ARN Shallow 2003-09-23 0.03 0.00 Sep 2003
...
2946 DC1 Shallow 2020-01-15 0.09 -0.03 Jan 2020
2947 DC1 Shallow 2020-02-15 0.12 0.03 Feb 2020
2948 DC1 Shallow 2020-03-15 0.13 0.01 Mar 2020
2949 BCI Shallow 2020-01-15 0.25 -0.02 Jan 2020
2950 BCI Shallow 2020-02-15 0.30 0.05 Feb 2020
2951 BCI Shallow 2020-03-15 0.33 0.03 Mar 2020
Deep water dataset
site depth date data delta yearmon
1 ARN Deep 2003-01-22 0.04 0.00 Jan 2003
2 ARN Deep 2003-04-28 0.00 -0.04 Apr 2003
3 ARN Deep 2003-05-28 0.00 0.00 May 2003
4 ARN Deep 2003-06-24 0.00 0.00 Jun 2003
5 ARN Deep 2003-08-26 0.02 0.02 Aug 2003
6 ARN Deep 2003-09-23 0.02 0.00 Sep 2003
...
2578 DC1 Deep 2020-01-15 0.09 0.04 Jan 2020
2579 DC1 Deep 2020-02-15 0.12 0.03 Feb 2020
2580 DC1 Deep 2020-03-15 0.13 0.01 Mar 2020
2581 BCI Deep 2020-01-15 0.25 -0.03 Jan 2020
2582 BCI Deep 2020-02-15 0.31 0.06 Feb 2020
2583 BCI Deep 2020-03-15 0.34 0.03 Mar 2020
有多個站點,其中一些缺少條目,因此行不會在整個長度的整個數據集中顯示相同的站點和日期。
我想有一個單獨的數據框,其中包含兩個數據集的數據列的平均值,用於兩個具有相同站點和日期的每個條目。 例如:
site depth date data delta yearmon
1 ARN AVG 2003-01-22 0.04 0.00 Jan 2003
2 ARN AVG 2003-04-28 0.00 -0.04 Apr 2003
3 ARN AVG 2003-05-28 0.00 0.00 May 2003
4 ARN AVG 2003-06-24 0.00 0.00 Jun 2003
5 ARN AVG 2003-08-26 0.025 0.02 Aug 2003
6 ARN AVG 2003-09-23 0.025 0.00 Sep 2003
...
? DC1 AVG 2020-01-15 0.09 0.04 Jan 2020
? DC1 AVG 2020-02-15 0.12 0.03 Feb 2020
? DC1 AVG 2020-03-15 0.13 0.01 Mar 2020
? BCI AVG 2020-01-15 0.25 -0.03 Jan 2020
? BCI AVG 2020-02-15 0.305 0.06 Feb 2020
? BCI AVG 2020-03-15 0.335 0.03 Mar 2020
問題是,我不知道如何處理不同長度的問題。 我能想到的最好的是:
if(shallow$yearmon == deep$yearmon & shallow$site == deep$site){
shallow$avg <- mean(shallow$data, deep$data)
}
這顯然不起作用。
請幫忙?
這樣的事情怎么樣? 我編造數據只是為了給出一個簡單的想法:
df_shallow <- data.frame(
site = c(rep("ARN",2),rep("DC1",2)),
depth = "Shallow",
data = rnorm(4,0,0.1),
yearmon = c("Jan 2003","Apr 2003","Jan 2003","Feb 2003")
)
df_deep <- data.frame(
site = c(rep("ARN",2),c("DC1","DC2")),
depth = "Deep",
data = rnorm(4,0,0.1),
yearmon = c("Jan 2003","Mar 2003","Jan 2003","Feb 2003")
)
# install.packages("tidyverse")
library(tidyverse)
df_avg <- full_join(df_shallow,df_deep) %>%
pivot_wider(id_cols = c(site,yearmon),names_from = depth,values_from = data) %>%
mutate(AVG = (Shallow + Deep) / 2) %>%
filter(!is.na(AVG)) %>%
select(-Shallow, -Deep) %>%
mutate(depth = "AVG") %>%
rename(data = AVG)
這會給你這樣的東西:
> head(df_avg)
# A tibble: 2 × 4
site yearmon data depth
<chr> <chr> <dbl> <chr>
1 ARN Jan 2003 0.146 AVG
2 DC1 Jan 2003 -0.0998 AVG
當然,您可以移動列以匹配您需要的任何順序。
考慮到該站點和日期在淺層和深層都是獨一無二的,您可以像這樣處理問題(我稍微更改了虛擬數據):
deep <- data.table::fread(" site depth date data delta mon year
ARN Deep 2003-01-22 0.04 0.00 Jan 2003
ARN Deep 2003-04-28 0.00 -0.04 Apr 2003
ARN Deep 2003-05-28 0.00 0.00 May 2003
ARN Deep 2003-06-24 0.00 0.00 Jun 2003
ARN Deep 2003-08-26 0.02 0.02 Aug 2003
ARN Deep 2003-09-23 0.02 0.00 Sep 2003")
shallow <- data.table::fread(" site depth date data delta mon year
ARN Shallow 2003-01-22 0.06 0.00 Jan 2003
ARN Shallow 2003-04-28 0.01 -0.02 Apr 2003
ARN Shallow 2003-05-28 0.05 0.00 May 2003
ARN Shallow 2003-06-24 0.00 0.00 Jun 2003
ARN Shallow 2003-08-26 0.03 0.03 Aug 2003
ARN Shallow 2003-09-23 0.03 0.00 Sep 2003")
df <- deep %>%
# inner join eliminates all none matching rows from both dfs
dplyr::inner_join(shallow, by = c("site","date"))
# note sure if you mean this result:
df %>%
dplyr::group_by(site) %>%
dplyr::summarise(deep = mean(data.x), shallow = mean(data.y)) %>%
dplyr::ungroup()
# A tibble: 1 x 3
site deep shallow
<chr> <dbl> <dbl>
1 ARN 0.0133 0.03
# or this
df %>%
# maybe you want to make sure there are no NAs left
dplyr::filter(!is.na(data.x) & !is.na(data.y)) %>%
# select and calculate in the same step
dplyr::transmute(site, date, mean_data = (data.x + data.y)/2)
site date mean_data
1: ARN 2003-01-22 0.050
2: ARN 2003-04-28 0.005
3: ARN 2003-05-28 0.025
4: ARN 2003-06-24 0.000
5: ARN 2003-08-26 0.025
6: ARN 2003-09-23 0.025
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.