简体   繁体   English

R dplyr full_join - 没有公共键,需要公共列混合在一起

[英]R dplyr full_join - no common key, need common columns to blend together

I have these two dataframes for example:例如,我有这两个数据框:

dates = c('2020-11-19', '2020-11-20', '2020-11-21')
df1 <- data.frame(dates, area = c('paris', 'london', 'newyork'), 
                  rating = c(10, 5, 6),
                  rating2 = c(5, 6, 7))

df2 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'), 
                  rating = c(1, 2, 1))
> df1
       dates    area rating rating2
1 2020-11-19   paris     10       5
2 2020-11-20  london      5       6
3 2020-11-21 newyork      6       7
> df2
       dates     area rating
1 2020-11-19 budapest      1
2 2020-11-20   moscow      2
3 2020-11-21 valencia      1

When performing an outer join using dplyr:使用 dplyr 执行外连接时:

df <- df1 %>%
  full_join(df2, by = c('dates', 'area'))

the result is like this:结果是这样的:

       dates     area rating.x rating2 rating.y
1 2020-11-19    paris       10       5       NA
2 2020-11-20   london        5       6       NA
3 2020-11-21  newyork        6       7       NA
4 2020-11-19 budapest       NA      NA        1
5 2020-11-20   moscow       NA      NA        2
6 2020-11-21 valencia       NA      NA        1

ie the rating columns from the two dataframes are not blended together but two separate columns are created.即来自两个数据框的评级列没有混合在一起,而是创建了两个单独的列。

How do I get a result like this?我怎样才能得到这样的结果?

       dates     area rating   rating2 
1 2020-11-19    paris       10       5       
2 2020-11-20   london        5       6       
3 2020-11-21  newyork        6       7       
4 2020-11-19 budapest        1      NA        
5 2020-11-20   moscow        2      NA        
6 2020-11-21 valencia        1      NA        

What you're looking for is dplyr::bind_rows() , which will preserve common columns and fill NA for columns that only exist in one of the data frames:您正在寻找的是dplyr::bind_rows() ,它将保留公共列并为仅存在于其中一个数据框中的列填充NA

> bind_rows(df1, df2)
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

Note that you could also continue using full_join() - but you must ensure that all common columns between the data frames are included as keys if you don't want columns to be split:请注意,您也可以继续使用full_join() - 但如果您不希望列被拆分,则必须确保数据框之间的所有公共列都作为键包含:

> full_join(
+   df1, df2,
+   by = c("dates", "area", "rating")
+ )
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

The documentation for dplyr joins mentions: dplyr 的文档加入提到:

Output columns include all x columns and all y columns. Output 列包括所有x列和所有y列。 If columns in x and y have the same name (and aren't included in by ), suffixes are added to disambiguate.如果xy中的列具有相同的名称(并且不包含在by中),则添加后缀以消除歧义。

You could also avoid this issue by not specifying by , in which case dplyr will use all common columns.您也可以通过不指定by来避免此问题,在这种情况下 dplyr 将使用所有常用列。

> full_join(df1, df2)
Joining, by = c("dates", "area", "rating")
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

As far as I know, both methods are good for your use case.据我所知,这两种方法都适合您的用例。 In fact, I believe that the practical advantage full_join() has over bind_rows() is precisely this behaviour you wanted to avoid here, ie splitting columns that aren't keys.事实上,我相信full_join()相对于bind_rows()的实际优势正是您希望在此处避免的这种行为,即拆分不是键的列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM