简体   繁体   English

是否有基于对变量的观察来合并数据集的命令?

[英]Is there a command to merge data sets based on observations of a variable?

I'm trying to merge two data sets, depending on the observations of the data sets.我正在尝试合并两个数据集,具体取决于对数据集的观察。

In other words, I have two data sets both containing year and state. However, the two data sets consists each of one additional variable, X1 for df1 and X2 for df2.换句话说,我有两个数据集都包含年份和 state。但是,这两个数据集都包含一个附加变量,X1 代表 df1,X2 代表 df2。 That is, I'm trying to merge the two dataset if a state is observed to a have minimum of 5 observation for both X1 and X2, such that the all observations for that state is included, even when both X1 and X2 are NA values.也就是说,如果观察到 state 对于 X1 和 X2 至少有 5 个观察值,我将尝试合并这两个数据集,这样即使 X1 和 X2 都是 NA 值,state 的所有观察值也包括在内.

Is there a way to merge the data sets to only merge states in which both X1 and X2 has minimum 5 observations?有没有办法合并数据集以仅合并 X1 和 X2 至少有 5 个观测值的状态? Such that, the new dataset has observations for all the years of the states that both X1 and X5 has a minimum of 5 observations in. While rest are excluded.这样,新数据集对 X1 和 X5 至少有 5 个观测值的状态的所有年份都有观测值。而 rest 被排除在外。

I have tried to use experiment with inner_join(df1,df2) with no success, as it only merges the year and state that both specific dataset has individual observations.我尝试使用inner_join(df1,df2)进行实验但没有成功,因为它只合并了年份和 state 这两个特定数据集都有单独的观察结果。

An reproducible example of the merge effect (for simplicity, I have used if 2 observations are non NA, the state is included)合并效果的可重现示例(为简单起见,如果 2 个观察结果为非 NA,我将使用它,包括 state)

df1 = read.table(
  text =
    "State Year X1
A 1 NA 
A 2 NA 
A 3 5 
A 4 NA 
B 1 NA 
B 2 NA 
B 3 4 
B 4 3", header = TRUE)

df2 = read.table(
  text =
    "State Year X2
A 1 NA 
A 2 5 
A 3 7 
A 4 NA 
B 1 NA 
B 2 2 
B 3 5 
B 4 7", header = TRUE)

newdf = read.table(
  text =
    "State Year X1 X2
B 1 NA NA 
B 2 NA 2
B 3 4 5
B 4 3 7", header = TRUE)

Here, newdf neglect the state A as the df1 only have one observation for that state, while all years are included for the state B (even the first year when both X1 and X2 are NA) as both X1 and X2 has minimum of 2 non-NA observations for that state. (recall, for simplicity here the minimum observation is 2 not 5)在这里, newdf忽略了 state A,因为df1只对 state 进行了一次观察,而 state B 的所有年份都包括在内(即使是X1X2均为 NA 的第一年),因为X1X2至少有 2 个非-NA 对 state 的观察。(回想一下,为简单起见,这里的最小观察是 2 而不是 5)

You need to do further filtering after merging.合并后需要做进一步的过滤。

library(dplyr)

inner_join(df1, df2, by = c("State", "Year")) %>%
  group_by(State) %>%
  filter(if_all(X1:X2, ~ sum(!is.na(.x)) >= 2)) %>%
  ungroup()

# # A tibble: 4 × 4
#   State  Year X1    X2   
#   <chr> <int> <chr> <chr>
# 1 B         1 NA    NA   
# 2 B         2 NA    2    
# 3 B         3 4     5    
# 4 B         4 3     7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM