I have a data frame of products (apple, pear, banana) sold across different locations (cities) within different categories (food and edibles).
I would like to count how many times any given pair of products appeared together in any category.
This is an example dataset I'm trying to make this to work on:
category <- c('food','food','food','food','food','food','edibles','edibles','edibles','edibles', 'edibles')
location <- c('houston, TX', 'houston, TX', 'las vegas, NV', 'las vegas, NV', 'philadelphia, PA', 'philadelphia, PA', 'austin, TX', 'austin, TX', 'charlotte, NC', 'charlotte, NC', 'charlotte, NC')
item <- c('apple', 'banana', 'apple', 'pear', 'apple', 'pear', 'pear', 'apple', 'apple', 'pear', 'banana')
food_data <- data.frame(cbind(category, location, item), stringsAsFactors = FALSE)
For example, the pair "apple & banana" appeared together in the "food" category in "las vegas, NV", but also in the "edibles" category in "charlotte, NC". Therefore, the count for the "apple & banana" pair would be 2.
My desired output is count of pairs like this:
(unordered) count of apple & banana
2
(unordered) count of apple & pear
4
Anyone have an idea for how to accomplish this? Relatively new to R and have been confused for a while.
I'm trying to use this to calculate affinities between different items.
Additional clarification on output: My full dataset consists of hundreds of different items. Would like to get a data frame where the first column is the pair and the second column is the count for each pair.
Here is one way using tidyverse
and crossprod
; By using spread
, it turns all item/fruit from the same category-location combination into one row with the item as headers (this requires you have no duplicated item in each category-country, otherwise you need a pre-aggregation step), values indicating existence; crossprod
essentially evaluates the inner product of pairs of items columns and gives the number of cooccurrences.
library(tidyverse)
food_data %>%
mutate(n = 1) %>%
spread(item, n, fill=0) %>%
select(-category, -location) %>%
{crossprod(as.matrix(.))} %>%
`diag<-`(0)
# apple banana pear
#apple 0 2 4
#banana 2 0 1
#pear 4 1 0
To convert this to a data frame:
food_data %>%
mutate(n = 1) %>%
spread(item, n, fill=0) %>%
select(-category, -location) %>%
{crossprod(as.matrix(.))} %>%
replace(lower.tri(., diag=T), NA) %>%
reshape2::melt(na.rm=T) %>%
unite('Pair', c('Var1', 'Var2'), sep=", ")
# Pair value
#4 apple, banana 2
#7 apple, pear 4
#8 banana, pear 1
A solution from the tidyverse
. The idea is to create food_data2
, which is the wide format of food_data
. After that, create the combination between each unique item and use map2_int
to loop through each item combination to count the number. This solution should work for any numbers of items.
library(tidyverse)
food_data2 <- food_data %>%
mutate(count = 1) %>%
spread(item, count, fill = 0)
food_combination <- food_data %>%
pull(item) %>%
unique() %>%
combn(2) %>%
t() %>%
as_data_frame() %>%
mutate(count = map2_int(V1, V2,
~sum(apply(food_data2 %>% select(.x, .y), 1, sum) == 2)))
# View the result
food_combination
# A tibble: 3 x 3
V1 V2 count
<chr> <chr> <int>
1 apple banana 2
2 apple pear 4
3 banana pear 1
If you just want one column to show the item combination at the end, you can further use the unite
function.
food_combination2 <- food_combination %>%
unite(Pair, V1, V2)
# View the result
food_combination2
# A tibble: 3 x 2
Pair count
* <chr> <int>
1 apple_banana 2
2 apple_pear 4
3 banana_pear 1
Here is a little function that will do what you need. It could be generalized to arbitrary grouping columns with the dplyr::
evaluation system described here . Probably better ways to do it but this works :p
Comments/explanations are inline ~~
library("dplyr")
# a function to apply to `food_data` from the original post
count_combos <- function(df, group_col1, group_col2, count_col){
# use `combn()` to get all the unique pairs from the `$items` col
combos <- t(combn(sort(unique(df[[count_col]])), 2)) %>%
as_data_frame() %>%
# initialize an empty column to catch the counts
mutate(count=NA)
# create a new df from the colnames passed as args,
# (it would be more general to just use the dplyr evaluation system (@_@))
df <- data_frame(
group_col1 = df[[group_col1]],
group_col2 = df[[group_col2]],
count_col = df[[count_col]]
)
# for each combo of the grouping vars, get a pipe-seperated string of items
df <- df %>%
group_by(group_col1, group_col2) %>% summarize(
items = paste(unique(count_col), collapse="|")
) %>% ungroup()
# for each item pair/combo, get the number of rows of `df` with both items
combos$count <- sapply(1:nrow(combos), function(x){
sum(grepl(combos$V1[x], df$items) & grepl(combos$V2[x], df$items))
})
# and return it in a nice df
return(combos)
}
# apply the function
count_combos(food_data,
group_col1="category", group_col2="location", count_col="item")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.