简体   繁体   English

将列与dplyr和/或tidyr合并/排序

[英]Combine/sort columns with dplyr and/or tidyr

EDIT: I've tried a solution below, but as I need to convert the factors to characters and back to factors, I lose some important information. 编辑:我已经尝试了下面的解决方案,但是当我需要将因素转换为字符并转换回因素时,我会丢失一些重要信息。

Having this table, I want it to be sorted from this, 有了这张桌子,我希望从中进行排序,

From    To  count
A       B     2
A       C     1
C       A     3
B       C     1

to this, 为此,

  From To count
1    A  B     2
2    A  C     4
3    B  C     1

So far I see two options, either to do this: 到目前为止,我看到了两种选择,或者选择两种:

df[1:2] <- t(apply(df[1:2], 1, sort))    
aggregate(count ~ From + To, df, sum)

which is quite slow as I'm working with 9.000.000 observations. 这很慢,因为我正在处理9.000.000观测值。 Or simply to convert this into an iGraph network, and merge the edges. 或者只是将其转换为iGraph网络,然后合并边。

g <- graph_from_data_frame(df, directed = TRUE, vertices = nodes)
g <- as.undirected(g, mode = "mutual", edge.attr.comb=list(weight = "sum"))

The 2 problems I have are that the first option I've mentioned should actually use dplyr or tidyr, but I couldn't figure out how to do it so far. 我遇到的两个问题是,我提到的第一个选项实际上应该使用dplyr或tidyr,但到目前为止我仍不知道该怎么做。

The network/igraph option which is quicker than the "t(apply(" option, but I still need to convert the graph back to a data.table for further analysis. 网络/ igraph选项比“ t(apply(”选项)要快,但是我仍然需要将图形转换回data.table进行进一步分析。

Any idea on how to run the "t(apply(" option with dplyr or tidyr? 关于如何使用dplyr或tidyr运行“ t(apply(”)选项的任何想法吗?

In base R, we can combine akrun's pmin and pmax suggestion with aggregate using the non-formula interface as follows: 在基数R中,我们可以使用非公式接口将akrun的pminpmax建议与aggregate结合起来,如下所示:

aggregate(df$count, list(From=pmin(df$From, df$To), To=pmax(df$From, df$To)), sum)
  From To x
1    A  B 2
2    A  C 4
3    B  C 1

Note that this requires that df$From and df$To are character vectors, not factors. 请注意,这要求df$Fromdf$To是字符向量,而不是因子。

timings 时机
This method is faster than using apply as it doesn't involve conversion to matrices. 此方法比使用apply更快,因为它不涉及转换为矩阵。 Using the data larger data set below, with 9 million observations, the time to completion using pmin and pmax with aggregate was 14.5 seconds on my computer whereas the OP's method with apply took 442.2 seconds or 30 times longer. 使用下面的较大数据集,有900万个观测值,在我的计算机上使用pminpmax进行aggregate完成时间为14.5秒,而OP的apply方法花费了442.2秒或30倍。

data 数据

df <-
structure(list(From = c("A", "A", "C", "B"), To = c("B", "C", 
"A", "C"), count = c(2L, 1L, 3L, 1L)), .Names = c("From", "To", 
"count"), class = "data.frame", row.names = c(NA, -4L))

larger sample data 更大的样本数据

set.seed(1234)
df <- data.frame(From=sample(LETTERS, 9e6, replace=TRUE),
                 To=sample(LETTERS, 9e6, replace=TRUE),
                 count=sample(100, 9e6, replace=TRUE),
                 stringsAsFactors=FALSE)

We can use pmin/pmax . 我们可以使用pmin/pmax Should be faster 应该更快

library(dplyr)
df1 %>% 
    group_by(From1 = pmin(From, To), To = pmax(From, To)) %>% 
    summarise(count = sum(count)) %>%
    rename(From = From1)
#  From    To count
#  <chr> <chr> <int>
#1     A     B     2
#2     A     C     4
#3     B     C     1
library(tidyverse)
cols_before_merge <- c("From", "To")
out_cols <- c("col_1", "col_2")

df <- tibble::tribble(
  ~From, ~To, ~count,
  "A", "B", 2,
  "A", "C", 1,
  "C", "A", 3,
  "B", "C", 1,
)

With the above, I think the tidyverse approach to creating the unique keys would be: 有了上述内容,我认为创建唯一键的方法可能是:

df_out <- df %>%
  dplyr::mutate(
    key = purrr::pmap_chr(
      list(From, To),
      ~ stringr::str_c(stringr::str_sort(c(...)), collapse = "_")
    )
  )

Or for a more programmatic approach using tidy evaluation : 或者使用整洁的评估来实现更具编程性的方法:

merge_sort <- function(cols_values) {
  purrr::pmap_chr(
    cols_values,
    ~ stringr::str_c(stringr::str_sort(c(...)), collapse = "_")
  )
}

add_key <- function(cols) {
  # column names need to be evaluated using the dataframe as an environment
  cols_quosure <- rlang::enquo(cols)

  # column names should be symbols not strings
  cols_syms <- rlang::syms(cols)

  cols_values <- purrr::map(
    cols_syms,
    ~ rlang::eval_tidy(.x, rlang::quo_get_env(cols_quosure))
  )

  merge_sort(cols_values)
}



# Adding columns for key construction programmatically
df_out <- df %>%
  dplyr::mutate(key = add_key(cols_before_merge))

And finally to get a count and make sure the output columns are factors (as akrun points out the factor levels before and after within row sorting could very easily be different). 最后要计数并确保输出列是因子(因为akrun指出行排序前后的因子水平很容易会有所不同)。

df_out %>%
  dplyr::count(key, name = "count") %>%
  tidyr::separate(key, sep = "_", into = out_cols) %>%
  dplyr::mutate_at(out_cols, as.factor)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM