简体   繁体   English

使用dplyr对R中的多个列进行排序

[英]Sort values across multiple columns in R with dplyr

Apologies for the not-particularly-clear title - hoping my example below helps. 对于标题不太明确的内容表示歉意-希望下面的示例有所帮助。 I am working with some sports data, attempting to compute "lineup statistics" for certain grouping of players in the data. 我正在处理一些体育数据,试图为数据中的某些运动员分组计算“阵容统计”。 Below is an example of the type of data I'm working with (playerInfo) , as well as the type of analysis I am attempting to do (groupedInfo) : 以下是我正在使用的数据类型(playerInfo)以及尝试进行的分析类型(groupedInfo)

playerInfo = data.frame(
  lineup = c(1,2,3,4,5,6),
  player1 = c("Bil", "Tom", "Tom", "Nik", "Nik", "Joe"),
  player1id = c("e91", "a27", "a27", "b17", "b17", "3b3"),
  player2 = c("Nik", "Bil", "Nik", "Joe", "Tom", "Tom"),
  player2id = c("b17", "e91", "b17", "3b3", "a27", "a27"),
  player3 = c("Joe", "Joe", "Joe", "Tom", "Joe", "Nik"),
  player3id = c("3b3", "3b3", "3b3", "a27", "3b3", "b17"),
  points = c(6, 8, 3, 12, 36, 2),
  stringsAsFactors = FALSE
)

groupedInfo <- playerInfo %>%
  dplyr::group_by(player1, player2, player3) %>%
  dplyr::summarise(
    lineup_ct = n(),
    total_pts = sum(points)
  )

> groupedInfo
# A tibble: 6 x 5
# Groups:   player1, player2 [?]
  player1 player2 player3 lineup_ct total_pts
  <chr>   <chr>   <chr>       <int>     <dbl>
1 Bil     Nik     Joe             1         6
2 Joe     Tom     Nik             1         2
3 Nik     Joe     Tom             1        12
4 Nik     Tom     Joe             1        36
5 Tom     Bil     Joe             1         8
6 Tom     Nik     Joe             1         3

The goal here is to group_by the 3 players in each row, and then compute some summary statistics (in this simple example, count and sum-of-points) for the different groups. 此处的目标是将每一行中的3个参与者分组,然后为不同的组计算一些摘要统计信息(在此简单示例中,为计数和总和)。 Unfortunately, what dplyr::group_by is missing is the fact that certain groups of players should be the same group of players, if its the same 3 players simply in different columns. 不幸的是,缺少dplyr::group_by的事实是,某些玩家组应该是同一组玩家,如果只是在不同的列中有相同的3名玩家。

For example, in the dataframe above, rows 3,4,5,6 all have the same 3 players (Nik, Tom, Joe), however because sometimes Nik is player1, and sometimes Nik is player2, etc., the group_by groups them separately. 例如,在上面的数据框中,第3、4、5、6行都具有相同的3个玩家(Nik,Tom,Joe),但是由于有时Nik是玩家1,有时Nik是玩家2,依此类推,所以group_by将它们分组分别。

For clarity, below is an example of the type of results I am seeking to get: 为了清楚起见,以下是我要获得的结果类型的示例:

correctPlayerInfo = data.frame(
  lineup = c(1,2,3,4,5,6),
  player1 = c("Bil", "Bil", "Joe", "Joe", "Joe", "Joe"),
  player1id = c("e91", "e91", "3b3", "3b3", "3b3", "3b3"),
  player2 = c("Joe", "Joe", "Nik", "Nik", "Nik", "Nik"),
  player2id = c("3b3", "3b3", "b17", "b17", "b17", "b17"),
  player3 = c("Nik", "Tom", "Tom", "Tom", "Tom", "Tom"),
  player3id = c("b17", "a27", "a27", "a27", "a27", "a27"),
  points = c(6, 8, 3, 12, 36, 2),
  stringsAsFactors = FALSE
)

correctGroupedInfo <- correctPlayerInfo %>%
  dplyr::group_by(player1, player2, player3) %>%
  dplyr::summarise(
    lineup_ct = n(),
    total_pts = sum(points)
  )

> correctGroupedInfo
# A tibble: 3 x 5
# Groups:   player1, player2 [?]
  player1 player2 player3 lineup_ct total_pts
  <chr>   <chr>   <chr>       <int>     <dbl>
1 Bil     Joe     Nik             1         6
2 Bil     Joe     Tom             1         8
3 Joe     Nik     Tom             4        53

In this second example, I have manually sorted the data alphabetically such that player1 < player2 < player3. 在第二个示例中,我手动按字母顺序对数据进行了排序,从而使player1 <player2 <player3。 As a result, when I do the group_by, it accurately groups rows 3-6 into a single grouping. 结果,当我执行group_by时,它将第3-6行准确地分组为一个分组。

How can I achieve this programatically? 如何以编程方式实现这一目标? I'm not sure if (a) re-structuring playerInfo into the column-sorted correctPlayerInfo (as I've done above(), or (b) some other approach where group_by automatically identifies that these are the same groups, is best. 我不确定(a)将playerInfo重组为按列排序的correctPlayerInfo(如我在上文()中所做的那样),或者(b)某些其他方法,其中group_by自动识别这些是相同的组,是否最好。

I am actively working on this, and will post updates if I can come about to my own solution. 我正在为此积极努力,如果可以解决自己的问题,将发布更新。 Until then, any help with this is greatly appreciated! 在此之前,对此的任何帮助将不胜感激!

Edit: Thus far I've tried something along these lines: 编辑:到目前为止,我已经尝试过以下方法:

newPlayerInfo <- playerInfo %>%
  dplyr::mutate(newPlayer1 = min(player1, player2, player3)) %>%
  dplyr::mutate(newPlayer3 = max(player1, player2, player3))

... to no avail. ...无济于事。

You could create group IDs that are sorted composites of the players' names (or IDs). 您可以创建按玩家姓名(或ID)排序的组ID。 For example: 例如:

playerInfo %>% 
  mutate(
    group_id = purrr::pmap_chr(
      .l = list(p1 = player1, p2 = player2, p3 = player3),
      .f = function(p1, p2, p3) paste(sort(c(p1, p2, p3)), collapse = "_")
    )
  ) %>% 
  group_by(group_id) %>% 
  summarise(
    lineup_ct = n(),
    total_pts = sum(points)
  )

# A tibble: 3 x 3
  group_id    lineup_ct total_pts
  <chr>           <int>     <dbl>
1 Bil_Joe_Nik         1         6
2 Bil_Joe_Tom         1         8
3 Joe_Nik_Tom         4        53

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM