I am analyzing the flow of customers between different shopping venues. I have data like this:
df <- data.frame(customer.id=letters[seq(1,7)],
shop.1=c(1,1,1,1,1,0,0),
shop.2=c(0,0,1,1,1,1,0),
shop.3=c(1,0,0,0,0,0,1))
df
#> customer.id shop.1 shop.2 shop.3
#> 1 a 1 0 1
#> 2 b 1 0 0
#> 3 c 1 1 0
#> 4 d 1 1 0
#> 5 e 1 1 0
#> 6 f 0 1 0
#> 7 g 0 0 1
So, for example:
customer "a" shopped at shops 1 & 3 only,
customer "b" shopped at shop 1 only,
customer "c" shopped at shops 1 & 2 only,
I want to summarize the data like so:
#> shop.1 shop.2 shop.3
#> shop.1 5 3 1
#> shop.2 3 4 0
#> shop.3 1 0 2
So, for example, row 1 reads:
How can I accomplish this (please note: I have many shops in my data set, so a scalable approach is preferred)?
crossprod
can take care of what you want to do, after a bit of basic manipulation to get it into 2 columns representing customer
and shop
respectively:
tmp <- cbind(df[1],stack(df[-1]))
tmp <- tmp[tmp$values==1,]
crossprod(table(tmp[c(1,3)]))
# ind
#ind shop.1 shop.2 shop.3
# shop.1 5 3 1
# shop.2 3 4 0
# shop.3 1 0 2
You want to tabulate the co-occurrence of shop.*
variables:
df[,2:4] <- sapply(df[,2:4], function(x) { ifelse(x=="", 0, 1) } )
1) It can supposedly be done using ftable(xtabs(...))
, but I struggled with that for ages and couldn't get it. The closest I got is:
> ftable(xtabs(~ shop.1 + shop.2 + shop.3, df))
shop.3 0 1
shop.1 shop.2
0 0 0 1
1 1 0
1 0 1 1
1 3 0
2) As @thelatemail showed, you could also:
# Transform your df from wide-form to long-form...
library(dplyr)
library(reshape2)
occurrence_df <- reshape2::melt(df, id.vars='customer.id') %>%
dplyr::filter(value==1)
customer.id variable value
1 a shop.1 1
2 b shop.1 1
3 c shop.1 1
4 d shop.1 1
5 e shop.1 1
6 c shop.2 1
7 d shop.2 1
8 e shop.2 1
9 f shop.2 1
10 a shop.3 1
11 g shop.3 1
Really we can drop value
column after the filter, so we could pipe %>% select(-value)
customer.id variable
1 a shop.1
2 b shop.1
3 c shop.1
4 d shop.1
5 e shop.1
6 c shop.2
7 d shop.2
8 e shop.2
9 f shop.2
10 a shop.3
11 g shop.3
# then same crossprod step as @thelatemail's answer:
crossprod(table(occurrence_df))
variable
variable shop.1 shop.2 shop.3
shop.1 5 3 1
shop.2 3 4 0
shop.3 1 0 2
(Footnotes:
read.csv
, use read.csv
arguments stringsAsFactors=TRUE
to make them factor, or colClasses
to make them numeric, and see all the many duplicate questions on that.) In fact, matrix operation seems enough because the data frame only has 0
and 1
.
First, exclude the customer.id
column and change the data.frame
to matrix
. This might be easy. ( mydf
is the name of your data frame.)
# base R way
as.matrix(mydf[,-1])
#> shop.1 shop.2 shop.3
#> [1,] 1 0 1
#> [2,] 1 0 0
#> [3,] 1 1 0
#> [4,] 1 1 0
#> [5,] 1 1 0
#> [6,] 0 1 0
#> [7,] 0 0 1
library(dplyr) #dplyr way
(mymat <-
mydf %>%
select(-customer.id) %>%
as.matrix())
#> shop.1 shop.2 shop.3
#> [1,] 1 0 1
#> [2,] 1 0 0
#> [3,] 1 1 0
#> [4,] 1 1 0
#> [5,] 1 1 0
#> [6,] 0 1 0
#> [7,] 0 0 1
With this matrix, just do the matrix operation as below.
t(mymat) %*% mymat
#> shop.1 shop.2 shop.3
#> shop.1 5 3 1
#> shop.2 3 4 0
#> shop.3 1 0 2
You can get your answer.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.