简体   繁体   English

如何计算一个因子的所有常见事件,在另一个因子水平的成对组合之间(最好使用dplyr)?

[英]How to count all common occurrences of one factor, between pairwise combinations of another factor level (preferably using dplyr)?

I have a data frame with two columns, the first corresponding to the name of a fruit, the second corresponding to the basket it's found in. 我有一个包含两列的数据框,第一列对应于水果的名称,第二列对应于它所在的篮子。

fruit_basket <- data.frame("fruit" = c("apple", "grapes", "banana", "grapes", "mangos", "apple", "mangos", "banana"),
"basket" = c("one", "one", "two", "two", "three", "three", "four", "four"))

I'd like the end result to be a lower or upper triangular matrix where the basket number are the rows and columns, and the value between two baskets are the number of common fruits. 我希望最终结果是一个下三角矩阵或上三角矩阵,其中篮子数是行和列,两个篮子之间的值是常见水果的数量。 For example, baskets one and two share 1 common fruit, grapes, so there would be a 1, baskets one and three share 1 common fruit, and so forth for all possible basket combinations. 例如,篮子1和2共用1个普通水果,葡萄,所以会有1个,篮子一个和三个共用1个普通水果,等等所有可能的篮子组合。 If possible I'd like the answer to use dplyr! 如果可能的话,我想要使用dplyr的答案!

Thank you. 谢谢。

Here's a fairly compact solution. 这是一个相当紧凑的解决方案。 It requires magrittr for the compound assignment operator ( %<>% ) and dplyr for mutate . 它需要magrittr用于复合赋值运算符( %<>% )和dplyr用于mutate First, I create the data frame. 首先,我创建数据框。

# Data frame
fruit_basket <- data.frame("fruit" = c("apple", "grapes", "banana", "grapes", "mangoes", "apple", "mangoes", "banana"),
                           "basket" = c("one", "one", "two", "two", "three", "three", "four", "four"))

Next, I convert basket numbers from words to actual numbers for simplicity. 接下来,为简单起见,我将篮子数字从单词转换为实际数字。 (This is pretty cludgy. There must be a better way.) (这很狡猾。必须有更好的方法。)

# Load libraries
library(magrittr)
library(dplyr)

# Convert words to numbers -- there has to be a better way!!!
fruit_basket %<>%
  mutate(basket = case_when(
    basket == "one" ~ 1,
    basket == "two" ~ 2,
    basket == "three" ~ 3,
    basket =="four" ~ 4
  ))

Then, I do the actual calculation and remove the diagonal and lower triangle (thanks to @smci for the one-liner for the latter!): 然后,我进行实际计算并删除对角线和下三角形(感谢@smci为后者的单线程!):

# Build table then calculate cross product 
res <- crossprod(table(fruit_basket))

# Remove lower triangle & diagonals
res[lower.tri(res, diag=T)] <- NA

which gives, 这使,

#         basket
# basket  1  2  3  4
#      1 NA  1  1  0
#      2 NA NA  0  1
#      3 NA NA NA  1
#      4 NA NA NA NA

I imagine somebody more fluent in all the functions of the tidyverse will come along and offer a more compact way of answering the questions. 我想有人会更加流利地说明tidyverse的所有功能,并提供一种更紧凑的方式来回答问题。 But for now, here is a simple way of solving your problem while using dplyr for some of it. 但就目前而言,这是一种解决问题的简单方法,同时使用dplyr

To start notice that I added a column representing the basket numbers numerically, this just makes subsetting a little more convenient. 要注意我在数字上添加了一个代表篮子数字的列,这只会使子集更方便一些。 Then I created a dataframe of missing values with the dimensions of the desired output dataframe. 然后,我使用所需输出数据帧的维度创建了缺失值的数据框。

Next, I looped through the different basket numbers, then used dplyr::filter and dplyr::pull() to get a vector of the fruits in each basket. 接下来,我循环浏览不同的篮子数,然后使用dplyr::filterdplyr::pull()来获得每个篮子中的水果矢量。 I then did another loop, where I got a vector of the fruits in each of the other baskets, and got the count for how many shared fruits there were. 然后我做了另一个循环,在那里我得到了每个其他篮子中的水果矢量,并得到了有多少共享水果的计数。

At the end of the loop, I replaced the column in the empty data frame with the vector of shared fruits for that basket number. 在循环结束时,我将空数据框中的列替换为该篮子编号的共享果实矢量。 At the end, I relabeled the columns to make it a bit more clear. 最后,我重新标记了列,使其更加清晰。

library(dplyr)

fruit_basket <- data.frame("fruit" = c("apple", "grapes", "banana", "grapes", "mangos", "apple", "mangos", "banana"),
                           "basket" = c("one", "one", "two", "two", "three", "three", "four", "four"),
                           stringsAsFactors = FALSE)


fruit_basket$basket_number <- c(rep(1, 2), rep(2, 2), rep(3, 2), rep(4, 2))



output_df <- data.frame(matrix(NA, nrow = 4, ncol = 4))

for (i in 1:max(fruit_basket$basket_number)) {

  fruits_in_current_basket <- fruit_basket %>% 
    filter(basket_number == i) %>% 
    pull(fruit)

  basket_count <- c()

  for (j in 1:4) {

    if (j == i) {

      shared_fruits <- 2

    }

    else {

      fruits_in_comparison_basket <- fruit_basket %>% 
        filter(basket_number == j) %>% 
        pull(fruit)

      shared_fruits <- sum(fruits_in_current_basket %in% fruits_in_comparison_basket)

    }

    basket_count <- c(basket_count, shared_fruits)


  }


  output_df[, i] <- basket_count


}
colnames(output_df) <- c("basket_one", "basket_two", "basket_three", "basket_four")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在另一个因素的基础上计算一个因素 - How to count one factor on basis of another factor 如何计算一个因素在另一个因素中出现的次数? - How do I count the number of occurrences of a factor within another factor? 计算一个数据帧中某一列中某个因子的出现次数并在另一个数据帧中输出 - Count occurrences of a factor in a column in one dataframe and output in another 如何删除r中所有因子变量中的一个特定因子水平? - How to remove one specific factor level in all factor variables in r? 对于一个因子的所有级别,请使用dplyr从同一数据帧返回另一个因子的所有级别。 [R - For all levels of a factor, return all levels of another factor from same dataframe - using dplyr ? r 对按另一个因子分组的因子的每个级别进行计数 - Performing a count of each level of a factor grouping by another factor 基于计数的数据帧中所有因子变量的折叠因子水平 - Collapsing factor level for all the factor variable in dataframe based on the count 一个因素与另一个因素之间的比较 - Comparison between one observation and the others by level of a factor 如何创建一个具有匹配条目数的成对矩阵,以比较数据帧中一个因子的所有级别? - How to create a pairwise matrix with counts of matching entries for comparisons of all levels of one factor in a dataframe? 根据另一个因素的水平改变一个因素的水平 - Change the level of a factor based on the level of another factor
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM