How can I summarize this data with R?

Question

I am analyzing the flow of customers between different shopping venues. I have data like this:

df <- data.frame(customer.id=letters[seq(1,7)], 
                 shop.1=c(1,1,1,1,1,0,0),
                 shop.2=c(0,0,1,1,1,1,0),
                 shop.3=c(1,0,0,0,0,0,1))
df

#>   customer.id shop.1 shop.2 shop.3
#> 1           a      1      0      1
#> 2           b      1      0      0  
#> 3           c      1      1      0 
#> 4           d      1      1      0 
#> 5           e      1      1      0 
#> 6           f      0      1      0 
#> 7           g      0      0      1

So, for example:

customer "a" shopped at shops 1 & 3 only,
customer "b" shopped at shop 1 only,
customer "c" shopped at shops 1 & 2 only,
etc.

I want to summarize the data like so:

#>           shop.1 shop.2 shop.3 
#> shop.1         5      3      1
#> shop.2         3      4      0       
#> shop.3         1      0      2

So, for example, row 1 reads:

5 people shopped at both shop 1 and shop 1 (this is obviously a redundant observation)
3 people shopped at both shop 1 and shop 2
1 person shopped at both shop 1 and shop 3

How can I accomplish this (please note: I have many shops in my data set, so a scalable approach is preferred)?

Answer 1

crossprod can take care of what you want to do, after a bit of basic manipulation to get it into 2 columns representing customer and shop respectively:

tmp <- cbind(df[1],stack(df[-1]))
tmp <- tmp[tmp$values==1,]

crossprod(table(tmp[c(1,3)]))

#        ind
#ind      shop.1 shop.2 shop.3
#  shop.1      5      3      1
#  shop.2      3      4      0
#  shop.3      1      0      2

Answer 2

You want to tabulate the co-occurrence of shop.* variables:

df[,2:4] <- sapply(df[,2:4], function(x) { ifelse(x=="", 0, 1) } )

1) It can supposedly be done using ftable(xtabs(...)) , but I struggled with that for ages and couldn't get it. The closest I got is:

> ftable(xtabs(~ shop.1 + shop.2 + shop.3, df))

              shop.3 0 1
shop.1 shop.2           
0      0             0 1
       1             1 0
1      0             1 1
       1             3 0

2) As @thelatemail showed, you could also:

# Transform your df from wide-form to long-form...
library(dplyr)
library(reshape2)
occurrence_df <- reshape2::melt(df, id.vars='customer.id') %>%
                 dplyr::filter(value==1)

   customer.id variable value
1            a   shop.1     1
2            b   shop.1     1
3            c   shop.1     1
4            d   shop.1     1
5            e   shop.1     1
6            c   shop.2     1
7            d   shop.2     1
8            e   shop.2     1
9            f   shop.2     1
10           a   shop.3     1
11           g   shop.3     1

Really we can drop value column after the filter, so we could pipe %>% select(-value)

   customer.id variable
1            a   shop.1
2            b   shop.1
3            c   shop.1
4            d   shop.1
5            e   shop.1
6            c   shop.2
7            d   shop.2
8            e   shop.2
9            f   shop.2
10           a   shop.3
11           g   shop.3

# then same crossprod step as @thelatemail's answer:

crossprod(table(occurrence_df))

        variable
variable shop.1 shop.2 shop.3
  shop.1      5      3      1
  shop.2      3      4      0
  shop.3      1      0      2

(Footnotes:

First your data should be numeric (or factor), not string. You want to convert "x" to 1 and "" to 0.
If they are strings because they came from read.csv , use read.csv arguments stringsAsFactors=TRUE to make them factor, or colClasses to make them numeric, and see all the many duplicate questions on that.)

Answer 3

In fact, matrix operation seems enough because the data frame only has 0 and 1 .

First, exclude the customer.id column and change the data.frame to matrix . This might be easy. ( mydf is the name of your data frame.)

# base R way
as.matrix(mydf[,-1])
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

library(dplyr) #dplyr way
(mymat <-
  mydf %>% 
  select(-customer.id) %>% 
  as.matrix())
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

With this matrix, just do the matrix operation as below.

t(mymat) %*% mymat
#>        shop.1 shop.2 shop.3
#> shop.1      5      3      1
#> shop.2      3      4      0
#> shop.3      1      0      2

You can get your answer.

How can I summarize this data with R?

Question

3 answers

solution1
4 ACCPTED 2019-03-05 03:37:42

solution2
3 2019-03-05 02:07:23

solution3
2 2019-03-05 04:18:21

How can I summarize this data with R?

Question

3 answers

solution1 4 ACCPTED 2019-03-05 03:37:42

solution2 3 2019-03-05 02:07:23

solution3 2 2019-03-05 04:18:21

solution1
4 ACCPTED 2019-03-05 03:37:42

solution2
3 2019-03-05 02:07:23

solution3
2 2019-03-05 04:18:21