如何计算一个因子的所有常见事件，在另一个因子水平的成对组合之间（最好使用dplyr）？

Question

I have a data frame with two columns, the first corresponding to the name of a fruit, the second corresponding to the basket it's found in. 我有一个包含两列的数据框，第一列对应于水果的名称，第二列对应于它所在的篮子。

fruit_basket <- data.frame("fruit" = c("apple", "grapes", "banana", "grapes", "mangos", "apple", "mangos", "banana"),
"basket" = c("one", "one", "two", "two", "three", "three", "four", "four"))

I'd like the end result to be a lower or upper triangular matrix where the basket number are the rows and columns, and the value between two baskets are the number of common fruits. 我希望最终结果是一个下三角矩阵或上三角矩阵，其中篮子数是行和列，两个篮子之间的值是常见水果的数量。 For example, baskets one and two share 1 common fruit, grapes, so there would be a 1, baskets one and three share 1 common fruit, and so forth for all possible basket combinations. 例如，篮子1和2共用1个普通水果，葡萄，所以会有1个，篮子一个和三个共用1个普通水果，等等所有可能的篮子组合。 If possible I'd like the answer to use dplyr! 如果可能的话，我想要使用dplyr的答案！

Thank you. 谢谢。

Answer 1

Here's a fairly compact solution. 这是一个相当紧凑的解决方案。 It requires magrittr for the compound assignment operator ( %<>% ) and dplyr for mutate . 它需要magrittr用于复合赋值运算符（ %<>% ）和dplyr用于mutate 。 First, I create the data frame. 首先，我创建数据框。

# Data frame
fruit_basket <- data.frame("fruit" = c("apple", "grapes", "banana", "grapes", "mangoes", "apple", "mangoes", "banana"),
                           "basket" = c("one", "one", "two", "two", "three", "three", "four", "four"))

Next, I convert basket numbers from words to actual numbers for simplicity. 接下来，为简单起见，我将篮子数字从单词转换为实际数字。 (This is pretty cludgy. There must be a better way.) （这很狡猾。必须有更好的方法。）

# Load libraries
library(magrittr)
library(dplyr)

# Convert words to numbers -- there has to be a better way!!!
fruit_basket %<>%
  mutate(basket = case_when(
    basket == "one" ~ 1,
    basket == "two" ~ 2,
    basket == "three" ~ 3,
    basket =="four" ~ 4
  ))

Then, I do the actual calculation and remove the diagonal and lower triangle (thanks to @smci for the one-liner for the latter!): 然后，我进行实际计算并删除对角线和下三角形（感谢@smci为后者的单线程！）：

# Build table then calculate cross product 
res <- crossprod(table(fruit_basket))

# Remove lower triangle & diagonals
res[lower.tri(res, diag=T)] <- NA

which gives, 这使，

#         basket
# basket  1  2  3  4
#      1 NA  1  1  0
#      2 NA NA  0  1
#      3 NA NA NA  1
#      4 NA NA NA NA

Answer 2

I imagine somebody more fluent in all the functions of the tidyverse will come along and offer a more compact way of answering the questions. 我想有人会更加流利地说明tidyverse的所有功能，并提供一种更紧凑的方式来回答问题。 But for now, here is a simple way of solving your problem while using dplyr for some of it. 但就目前而言，这是一种解决问题的简单方法，同时使用dplyr 。

To start notice that I added a column representing the basket numbers numerically, this just makes subsetting a little more convenient. 要注意我在数字上添加了一个代表篮子数字的列，这只会使子集更方便一些。 Then I created a dataframe of missing values with the dimensions of the desired output dataframe. 然后，我使用所需输出数据帧的维度创建了缺失值的数据框。

Next, I looped through the different basket numbers, then used dplyr::filter and dplyr::pull() to get a vector of the fruits in each basket. 接下来，我循环浏览不同的篮子数，然后使用dplyr::filter和dplyr::pull()来获得每个篮子中的水果矢量。 I then did another loop, where I got a vector of the fruits in each of the other baskets, and got the count for how many shared fruits there were. 然后我做了另一个循环，在那里我得到了每个其他篮子中的水果矢量，并得到了有多少共享水果的计数。

At the end of the loop, I replaced the column in the empty data frame with the vector of shared fruits for that basket number. 在循环结束时，我将空数据框中的列替换为该篮子编号的共享果实矢量。 At the end, I relabeled the columns to make it a bit more clear. 最后，我重新标记了列，使其更加清晰。

library(dplyr)

fruit_basket <- data.frame("fruit" = c("apple", "grapes", "banana", "grapes", "mangos", "apple", "mangos", "banana"),
                           "basket" = c("one", "one", "two", "two", "three", "three", "four", "four"),
                           stringsAsFactors = FALSE)


fruit_basket$basket_number <- c(rep(1, 2), rep(2, 2), rep(3, 2), rep(4, 2))



output_df <- data.frame(matrix(NA, nrow = 4, ncol = 4))

for (i in 1:max(fruit_basket$basket_number)) {

  fruits_in_current_basket <- fruit_basket %>% 
    filter(basket_number == i) %>% 
    pull(fruit)

  basket_count <- c()

  for (j in 1:4) {

    if (j == i) {

      shared_fruits <- 2

    }

    else {

      fruits_in_comparison_basket <- fruit_basket %>% 
        filter(basket_number == j) %>% 
        pull(fruit)

      shared_fruits <- sum(fruits_in_current_basket %in% fruits_in_comparison_basket)

    }

    basket_count <- c(basket_count, shared_fruits)


  }


  output_df[, i] <- basket_count


}
colnames(output_df) <- c("basket_one", "basket_two", "basket_three", "basket_four")

如何计算一个因子的所有常见事件，在另一个因子水平的成对组合之间（最好使用dplyr）？

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-11-08 18:09:29

解决方案2
1 2018-11-08 18:09:17

如何计算一个因子的所有常见事件，在另一个因子水平的成对组合之间（最好使用dplyr）？

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-11-08 18:09:29

解决方案2 1 2018-11-08 18:09:17

解决方案1
2 已采纳 2018-11-08 18:09:29

解决方案2
1 2018-11-08 18:09:17