简体   繁体   中英

How can I summarize this data with R?

I am analyzing the flow of customers between different shopping venues. I have data like this:

df <- data.frame(customer.id=letters[seq(1,7)], 
                 shop.1=c(1,1,1,1,1,0,0),
                 shop.2=c(0,0,1,1,1,1,0),
                 shop.3=c(1,0,0,0,0,0,1))
df
#>   customer.id shop.1 shop.2 shop.3
#> 1           a      1      0      1
#> 2           b      1      0      0  
#> 3           c      1      1      0 
#> 4           d      1      1      0 
#> 5           e      1      1      0 
#> 6           f      0      1      0 
#> 7           g      0      0      1

So, for example:

  • customer "a" shopped at shops 1 & 3 only,

  • customer "b" shopped at shop 1 only,

  • customer "c" shopped at shops 1 & 2 only,

  • etc.

I want to summarize the data like so:

#>           shop.1 shop.2 shop.3 
#> shop.1         5      3      1
#> shop.2         3      4      0       
#> shop.3         1      0      2       

So, for example, row 1 reads:

  • 5 people shopped at both shop 1 and shop 1 (this is obviously a redundant observation)
  • 3 people shopped at both shop 1 and shop 2
  • 1 person shopped at both shop 1 and shop 3

How can I accomplish this (please note: I have many shops in my data set, so a scalable approach is preferred)?

crossprod can take care of what you want to do, after a bit of basic manipulation to get it into 2 columns representing customer and shop respectively:

tmp <- cbind(df[1],stack(df[-1]))
tmp <- tmp[tmp$values==1,]

crossprod(table(tmp[c(1,3)]))

#        ind
#ind      shop.1 shop.2 shop.3
#  shop.1      5      3      1
#  shop.2      3      4      0
#  shop.3      1      0      2

You want to tabulate the co-occurrence of shop.* variables:

df[,2:4] <- sapply(df[,2:4], function(x) { ifelse(x=="", 0, 1) } )

1) It can supposedly be done using ftable(xtabs(...)) , but I struggled with that for ages and couldn't get it. The closest I got is:

> ftable(xtabs(~ shop.1 + shop.2 + shop.3, df))

              shop.3 0 1
shop.1 shop.2           
0      0             0 1
       1             1 0
1      0             1 1
       1             3 0

2) As @thelatemail showed, you could also:

# Transform your df from wide-form to long-form...
library(dplyr)
library(reshape2)
occurrence_df <- reshape2::melt(df, id.vars='customer.id') %>%
                 dplyr::filter(value==1)

   customer.id variable value
1            a   shop.1     1
2            b   shop.1     1
3            c   shop.1     1
4            d   shop.1     1
5            e   shop.1     1
6            c   shop.2     1
7            d   shop.2     1
8            e   shop.2     1
9            f   shop.2     1
10           a   shop.3     1
11           g   shop.3     1

Really we can drop value column after the filter, so we could pipe %>% select(-value)

   customer.id variable
1            a   shop.1
2            b   shop.1
3            c   shop.1
4            d   shop.1
5            e   shop.1
6            c   shop.2
7            d   shop.2
8            e   shop.2
9            f   shop.2
10           a   shop.3
11           g   shop.3

# then same crossprod step as @thelatemail's answer:

crossprod(table(occurrence_df))

        variable
variable shop.1 shop.2 shop.3
  shop.1      5      3      1
  shop.2      3      4      0
  shop.3      1      0      2

(Footnotes:

  • First your data should be numeric (or factor), not string. You want to convert "x" to 1 and "" to 0.
  • If they are strings because they came from read.csv , use read.csv arguments stringsAsFactors=TRUE to make them factor, or colClasses to make them numeric, and see all the many duplicate questions on that.)

In fact, matrix operation seems enough because the data frame only has 0 and 1 .

First, exclude the customer.id column and change the data.frame to matrix . This might be easy. ( mydf is the name of your data frame.)

# base R way
as.matrix(mydf[,-1])
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

library(dplyr) #dplyr way
(mymat <-
  mydf %>% 
  select(-customer.id) %>% 
  as.matrix())
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

With this matrix, just do the matrix operation as below.

t(mymat) %*% mymat
#>        shop.1 shop.2 shop.3
#> shop.1      5      3      1
#> shop.2      3      4      0
#> shop.3      1      0      2

You can get your answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM