简体   繁体   中英

Combine/sort columns with dplyr and/or tidyr

EDIT: I've tried a solution below, but as I need to convert the factors to characters and back to factors, I lose some important information.

Having this table, I want it to be sorted from this,

From    To  count
A       B     2
A       C     1
C       A     3
B       C     1

to this,

  From To count
1    A  B     2
2    A  C     4
3    B  C     1

So far I see two options, either to do this:

df[1:2] <- t(apply(df[1:2], 1, sort))    
aggregate(count ~ From + To, df, sum)

which is quite slow as I'm working with 9.000.000 observations. Or simply to convert this into an iGraph network, and merge the edges.

g <- graph_from_data_frame(df, directed = TRUE, vertices = nodes)
g <- as.undirected(g, mode = "mutual", edge.attr.comb=list(weight = "sum"))

The 2 problems I have are that the first option I've mentioned should actually use dplyr or tidyr, but I couldn't figure out how to do it so far.

The network/igraph option which is quicker than the "t(apply(" option, but I still need to convert the graph back to a data.table for further analysis.

Any idea on how to run the "t(apply(" option with dplyr or tidyr?

In base R, we can combine akrun's pmin and pmax suggestion with aggregate using the non-formula interface as follows:

aggregate(df$count, list(From=pmin(df$From, df$To), To=pmax(df$From, df$To)), sum)
  From To x
1    A  B 2
2    A  C 4
3    B  C 1

Note that this requires that df$From and df$To are character vectors, not factors.

timings
This method is faster than using apply as it doesn't involve conversion to matrices. Using the data larger data set below, with 9 million observations, the time to completion using pmin and pmax with aggregate was 14.5 seconds on my computer whereas the OP's method with apply took 442.2 seconds or 30 times longer.

data

df <-
structure(list(From = c("A", "A", "C", "B"), To = c("B", "C", 
"A", "C"), count = c(2L, 1L, 3L, 1L)), .Names = c("From", "To", 
"count"), class = "data.frame", row.names = c(NA, -4L))

larger sample data

set.seed(1234)
df <- data.frame(From=sample(LETTERS, 9e6, replace=TRUE),
                 To=sample(LETTERS, 9e6, replace=TRUE),
                 count=sample(100, 9e6, replace=TRUE),
                 stringsAsFactors=FALSE)

We can use pmin/pmax . Should be faster

library(dplyr)
df1 %>% 
    group_by(From1 = pmin(From, To), To = pmax(From, To)) %>% 
    summarise(count = sum(count)) %>%
    rename(From = From1)
#  From    To count
#  <chr> <chr> <int>
#1     A     B     2
#2     A     C     4
#3     B     C     1
library(tidyverse)
cols_before_merge <- c("From", "To")
out_cols <- c("col_1", "col_2")

df <- tibble::tribble(
  ~From, ~To, ~count,
  "A", "B", 2,
  "A", "C", 1,
  "C", "A", 3,
  "B", "C", 1,
)

With the above, I think the tidyverse approach to creating the unique keys would be:

df_out <- df %>%
  dplyr::mutate(
    key = purrr::pmap_chr(
      list(From, To),
      ~ stringr::str_c(stringr::str_sort(c(...)), collapse = "_")
    )
  )

Or for a more programmatic approach using tidy evaluation :

merge_sort <- function(cols_values) {
  purrr::pmap_chr(
    cols_values,
    ~ stringr::str_c(stringr::str_sort(c(...)), collapse = "_")
  )
}

add_key <- function(cols) {
  # column names need to be evaluated using the dataframe as an environment
  cols_quosure <- rlang::enquo(cols)

  # column names should be symbols not strings
  cols_syms <- rlang::syms(cols)

  cols_values <- purrr::map(
    cols_syms,
    ~ rlang::eval_tidy(.x, rlang::quo_get_env(cols_quosure))
  )

  merge_sort(cols_values)
}



# Adding columns for key construction programmatically
df_out <- df %>%
  dplyr::mutate(key = add_key(cols_before_merge))

And finally to get a count and make sure the output columns are factors (as akrun points out the factor levels before and after within row sorting could very easily be different).

df_out %>%
  dplyr::count(key, name = "count") %>%
  tidyr::separate(key, sep = "_", into = out_cols) %>%
  dplyr::mutate_at(out_cols, as.factor)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM