[英]R - compare multiple columns of two dataframes
I have two datasets of different nrows:我有两个不同 nrows 的数据集:
Shallow water dataset
site depth date data delta yearmon
1 ARN Shallow 2003-01-22 0.04 0.00 Jan 2003
2 ARN Shallow 2003-04-28 0.00 -0.04 Apr 2003
3 ARN Shallow 2003-05-28 0.00 0.00 May 2003
4 ARN Shallow 2003-06-24 0.00 0.00 Jun 2003
5 ARN Shallow 2003-08-26 0.03 0.03 Aug 2003
6 ARN Shallow 2003-09-23 0.03 0.00 Sep 2003
...
2946 DC1 Shallow 2020-01-15 0.09 -0.03 Jan 2020
2947 DC1 Shallow 2020-02-15 0.12 0.03 Feb 2020
2948 DC1 Shallow 2020-03-15 0.13 0.01 Mar 2020
2949 BCI Shallow 2020-01-15 0.25 -0.02 Jan 2020
2950 BCI Shallow 2020-02-15 0.30 0.05 Feb 2020
2951 BCI Shallow 2020-03-15 0.33 0.03 Mar 2020
Deep water dataset
site depth date data delta yearmon
1 ARN Deep 2003-01-22 0.04 0.00 Jan 2003
2 ARN Deep 2003-04-28 0.00 -0.04 Apr 2003
3 ARN Deep 2003-05-28 0.00 0.00 May 2003
4 ARN Deep 2003-06-24 0.00 0.00 Jun 2003
5 ARN Deep 2003-08-26 0.02 0.02 Aug 2003
6 ARN Deep 2003-09-23 0.02 0.00 Sep 2003
...
2578 DC1 Deep 2020-01-15 0.09 0.04 Jan 2020
2579 DC1 Deep 2020-02-15 0.12 0.03 Feb 2020
2580 DC1 Deep 2020-03-15 0.13 0.01 Mar 2020
2581 BCI Deep 2020-01-15 0.25 -0.03 Jan 2020
2582 BCI Deep 2020-02-15 0.31 0.06 Feb 2020
2583 BCI Deep 2020-03-15 0.34 0.03 Mar 2020
There are multiple sites, some with missing entries, so the rows are not going to show the same site and date across datasets throughout their entire lengths.有多个站点,其中一些缺少条目,因此行不会在整个长度的整个数据集中显示相同的站点和日期。
I'd like to have a separate dataframe, with the averages of the data column of the two datasets, for every entry that has same site and date in both.我想有一个单独的数据框,其中包含两个数据集的数据列的平均值,用于两个具有相同站点和日期的每个条目。 For example:
例如:
site depth date data delta yearmon
1 ARN AVG 2003-01-22 0.04 0.00 Jan 2003
2 ARN AVG 2003-04-28 0.00 -0.04 Apr 2003
3 ARN AVG 2003-05-28 0.00 0.00 May 2003
4 ARN AVG 2003-06-24 0.00 0.00 Jun 2003
5 ARN AVG 2003-08-26 0.025 0.02 Aug 2003
6 ARN AVG 2003-09-23 0.025 0.00 Sep 2003
...
? DC1 AVG 2020-01-15 0.09 0.04 Jan 2020
? DC1 AVG 2020-02-15 0.12 0.03 Feb 2020
? DC1 AVG 2020-03-15 0.13 0.01 Mar 2020
? BCI AVG 2020-01-15 0.25 -0.03 Jan 2020
? BCI AVG 2020-02-15 0.305 0.06 Feb 2020
? BCI AVG 2020-03-15 0.335 0.03 Mar 2020
The problem is, I got no clue how to deal with the different lengths issue.问题是,我不知道如何处理不同长度的问题。 Best I could think of was this:
我能想到的最好的是:
if(shallow$yearmon == deep$yearmon & shallow$site == deep$site){
shallow$avg <- mean(shallow$data, deep$data)
}
Which evidently does not work.这显然不起作用。
Plz help?请帮忙?
How about something like this?这样的事情怎么样? I made up data just to give a simple idea:
我编造数据只是为了给出一个简单的想法:
df_shallow <- data.frame(
site = c(rep("ARN",2),rep("DC1",2)),
depth = "Shallow",
data = rnorm(4,0,0.1),
yearmon = c("Jan 2003","Apr 2003","Jan 2003","Feb 2003")
)
df_deep <- data.frame(
site = c(rep("ARN",2),c("DC1","DC2")),
depth = "Deep",
data = rnorm(4,0,0.1),
yearmon = c("Jan 2003","Mar 2003","Jan 2003","Feb 2003")
)
# install.packages("tidyverse")
library(tidyverse)
df_avg <- full_join(df_shallow,df_deep) %>%
pivot_wider(id_cols = c(site,yearmon),names_from = depth,values_from = data) %>%
mutate(AVG = (Shallow + Deep) / 2) %>%
filter(!is.na(AVG)) %>%
select(-Shallow, -Deep) %>%
mutate(depth = "AVG") %>%
rename(data = AVG)
Which would give you something like this:这会给你这样的东西:
> head(df_avg)
# A tibble: 2 × 4
site yearmon data depth
<chr> <chr> <dbl> <chr>
1 ARN Jan 2003 0.146 AVG
2 DC1 Jan 2003 -0.0998 AVG
You could, of course, move the columns around to match whatever order you need.当然,您可以移动列以匹配您需要的任何顺序。
Considering that site and date are unique in shallow and deep you could approach the problem like this (I altered the dummy data a bit):考虑到该站点和日期在浅层和深层都是独一无二的,您可以像这样处理问题(我稍微更改了虚拟数据):
deep <- data.table::fread(" site depth date data delta mon year
ARN Deep 2003-01-22 0.04 0.00 Jan 2003
ARN Deep 2003-04-28 0.00 -0.04 Apr 2003
ARN Deep 2003-05-28 0.00 0.00 May 2003
ARN Deep 2003-06-24 0.00 0.00 Jun 2003
ARN Deep 2003-08-26 0.02 0.02 Aug 2003
ARN Deep 2003-09-23 0.02 0.00 Sep 2003")
shallow <- data.table::fread(" site depth date data delta mon year
ARN Shallow 2003-01-22 0.06 0.00 Jan 2003
ARN Shallow 2003-04-28 0.01 -0.02 Apr 2003
ARN Shallow 2003-05-28 0.05 0.00 May 2003
ARN Shallow 2003-06-24 0.00 0.00 Jun 2003
ARN Shallow 2003-08-26 0.03 0.03 Aug 2003
ARN Shallow 2003-09-23 0.03 0.00 Sep 2003")
df <- deep %>%
# inner join eliminates all none matching rows from both dfs
dplyr::inner_join(shallow, by = c("site","date"))
# note sure if you mean this result:
df %>%
dplyr::group_by(site) %>%
dplyr::summarise(deep = mean(data.x), shallow = mean(data.y)) %>%
dplyr::ungroup()
# A tibble: 1 x 3
site deep shallow
<chr> <dbl> <dbl>
1 ARN 0.0133 0.03
# or this
df %>%
# maybe you want to make sure there are no NAs left
dplyr::filter(!is.na(data.x) & !is.na(data.y)) %>%
# select and calculate in the same step
dplyr::transmute(site, date, mean_data = (data.x + data.y)/2)
site date mean_data
1: ARN 2003-01-22 0.050
2: ARN 2003-04-28 0.005
3: ARN 2003-05-28 0.025
4: ARN 2003-06-24 0.000
5: ARN 2003-08-26 0.025
6: ARN 2003-09-23 0.025
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.