简体   繁体   English

在 R 中使用 full_join 处理重复的列

[英]Dealing with duplicated columns using full_join in R

Good afternoon!下午好!

I'm currently working on a data manipulation task using R and faced with a dilemma.我目前正在使用 R 处理数据操作任务,并面临两难境地。

Two tables are around, and my goal is to join these tables using specific keys.周围有两张桌子,我的目标是使用特定的键连接这些桌子。

Table1:表格1:

Name <- c("John", "Michael", "Anna", "Boris")
ID <- c("ID1", "ID2", "ID3", "ID4")
PDN <- c(40, 10, 6, 70)
Sum3107 <- c(16, 10, 53, 44)
Sum3108 <- c(16, 8, 50, 43)

table1 <- data.frame(Name, ID, PDN, Sum3107, Sum3108)

And Table2:和表2:

Name <- c("Martin", "Anna", "Olga", "Boris")
ID <- c("ID6", "ID3", "ID7", "ID4")
PDN <- c(22, 6, 44, 70)
Sum3009 <- c(10, 8, 45, 30)
Sum3110 <- c(9, 6, 30, 20)

table2 <- data.frame(Name, ID, PDN, Sum3009, Sum3110)

I've opted for a full_join operator as it perfectly solves the task in theory:我选择了 full_join 运算符,因为它在理论上完美地解决了任务:

table3 <- full_join(table1, table2, by = c("Name", "ID", "PDN"))

Everything is correct because all the repeated columns in these two tables are selected as keys.一切都是正确的,因为这两个表中所有重复的列都被选为键。

But if I need to select as keys only specific column names, and opt for a full_join, R will duplicate columns that are repeated in to tables, which is not the I expect.但是,如果我需要 select 作为键仅特定列名,并选择 full_join,R 将复制在表中重复的列,这不是我所期望的。

table3 <- full_join(table1, table2, by = c("Name", "ID")) #"PDN" was removed

Is it possible to run a join on specific columns rather than all repeated in two tables without getting duplicated results?是否可以在特定列上运行连接,而不是在两个表中全部重复而不会得到重复的结果?

Expected result: I want to get a full join from two tables using only two keys (c("Name", "ID")), where "PDN" column is shown but not duplicated in a result section (PDN.x and PDN.y are not around).预期结果:我想只使用两个键(c(“Name”,“ID”))从两个表中获得完全连接,其中显示“PDN”列但在结果部分(PDN.x 和 PDN)中不重复.y 不在附近)。

Thank you in advance!先感谢您! Any help is highly appreciated!非常感谢任何帮助!

Does this help?这有帮助吗? Same output as full join in different order.与完全连接相同的 output 以不同的顺序。 I'm not specifying PDN, but I am specifying the columns I want to sum, which excludes PDN.我没有指定 PDN,但我指定了要求和的列,其中不包括 PDN。

bind_rows(table1, table2) %>%
  group_by(Name, ID) %>%
  summarise(across(contains("Sum"), ~sum(.x, na.rm = T)), .groups = "drop")

I can't yet think of a way to make R treat the PDN column differently from the Sum columns without giving it some indication that it should be treated like a key and/or the others should be treated like values.我还想不出一种方法来让 R 将 PDN 列与 Sum 列区别对待,而没有给出一些迹象表明它应该被视为键和/或其他应该被视为值。


Edit - This isn't elegant, but another approach you could take would be to do your desired join, and then "fix it in post."编辑 - 这并不优雅,但您可以采取的另一种方法是进行您想要的加入,然后“在帖子中修复它”。 Here done by reshaping long, removing any ".x" or ".y" from column name, filtering for first non-NA, then pivoting wide again.这里通过重新整形 long,从列名中删除任何“.x”或“.y”,过滤第一个非 NA,然后再次旋转宽来完成。

But this is definitely worse.但这肯定更糟。 :-) :-)

full_join(table1, table2, by = c("Name", "ID")) %>%
  pivot_longer(-c(Name, ID)) %>%
  mutate(name = name %>% str_remove(".x|.y")) %>%
  filter(!is.na(value)) %>%
  group_by(Name, ID, name) %>% slice(1) %>% ungroup() %>%
  pivot_wider(names_from = name, values_from = value)

# A tibble: 6 x 7
  Name    ID      PDN Sum3009 Sum3107 Sum3108 Sum3110
  <chr>   <chr> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 Anna    ID3       6       8      53      50       6
2 Boris   ID4      70      30      44      43      20
3 John    ID1      40      NA      16      16      NA
4 Martin  ID6      22      10      NA      NA       9
5 Michael ID2      10      NA      10       8      NA
6 Olga    ID7      44      45      NA      NA      30

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM