How do I make my create a dataframe with multiple categorical variables and interaction effects, grouped by ID?

Question

I want to set up my dataframe so that it groups by my ID column, but have many columns for my categorical variables and interaction effects.

So this is how the original table looks like.

+----+----------------+---------+
| ID |      Page      |  Click  |
+----+----------------+---------+
|  1 | homepage       | logo    |
|  1 | homepage       | search  |
|  1 | category page  | logo    |
|  1 | category page  | search  |
|  2 | homepage       | logo    |
|  2 | homepage       | search  |
| .. |                |         | 
+----+----------------+---------+

I would like to make it into a table like this.

+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+
| ID | Page_homepage  | Page_categorypage  | Click_logo | Click_search  | homepage:search | categorypage:search  | homepage:logo | categorypage:logo |
+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+
|  1 |              1 |                  1 |          1 |             1 |               1 |                    1 |             1 |                 1 |
|  2 |              1 |                  0 |          1 |             1 |               1 |                    0 |             1 |                 0 |
+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+

My objective is to be able to create features with interaction effects to perform a logistic regression. There are outputs associated with each ID, so it's important for me to group the results by ID.

What is the best and simplest way to do this? I don't want to manually do it for all the possible variations. I'm indifferent between using R/Python/SQL to perform this.

Answer 1

One way to go about this is to do the individual variables and the interactions separately, then join them together:

library(tidyverse)
tbl <- structure(list(ID = c(1, 1, 1, 1, 2, 2), Page = c("homepage", "homepage", "categorypage", "categorypage", "homepage", "homepage"), Click = c("logo", "search", "logo", "search", "logo", "search")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_double", "collector")), Page = structure(list(), class = c("collector_character", "collector")), Click = structure(list(), class = c("collector_character", "collector")), X4 = structure(list(), class = c("collector_logical", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 2), class = "col_spec"))
tbl
#> # A tibble: 6 x 3
#>      ID Page         Click 
#>   <dbl> <chr>        <chr> 
#> 1     1 homepage     logo  
#> 2     1 homepage     search
#> 3     1 categorypage logo  
#> 4     1 categorypage search
#> 5     2 homepage     logo  
#> 6     2 homepage     search

tbl %>%
  gather(variable, value, Page, Click) %>%
  transmute(ID, colname = str_c(variable, "_", value), presence = 1) %>%
  distinct() %>% # Individual variables now done, now add interactions
  bind_rows(transmute(tbl, ID, colname = str_c(Page, ":", Click), presence = 1)) %>%
  spread(colname, presence, fill = 0) %>%
  select(ID, matches("Page_"), matches("Click_"), matches(":"))
#> # A tibble: 2 x 9
#>      ID Page_categorypa… Page_homepage Click_logo Click_search
#>   <dbl>            <dbl>         <dbl>      <dbl>        <dbl>
#> 1     1                1             1          1            1
#> 2     2                0             1          1            1
#> # … with 4 more variables: `categorypage:logo` <dbl>,
#> #   `categorypage:search` <dbl>, `homepage:logo` <dbl>,
#> #   `homepage:search` <dbl>

^{Created on 2019-05-22 by the reprex package (v0.2.1)}

Answer 2

Ok here is another approach. I was trying to make it work with as little assumptions about table column names and its size as it is possible. So the only assumption is that we have id column in the first column of the table and the rest of columns have type character just as in your example.


library(dplyr)
library(purrr)

df <- data.frame( id = c(1,1,2,2,2,3,3), page = c("home", "home", "home", "cat", "cat", "cat", "hat"), 
                  click = c("search", "logo", "search", "logo", "search", "banana", "banana") )

# auxiliary function for reshape
indicate <- function(x) {
  as.integer(!is_empty(x))
}

# column list for which we want to create the table
cols <- df %>% select(-id) %>% colnames()

# changing variable levels names
purrr::map(cols, function(colname) {
  df %>% pull(colname) %>% gsub("^", paste0(colname, "_"), .)
}) %>% bind_cols() %>% setNames(cols) %>% bind_cols(df %>% select(id), .) -> df2

# creating indicator column for each variable level
purrr::map(cols, function(colname) {
  form.string <- paste("id ~", colname)
  reshape2::dcast(df2, as.formula(form.string), indicate)
}) %>% bind_cols() %>% 
  select(-matches("id\\d+")) -> result

# creating formula for all interactions between variables and joining with the rest of analysis
formula <- paste0("id ~ ", paste(cols, collapse = "+")) %>% as.formula()
df %>% reshape2::dcast(., formula, indicate) %>%
  left_join(., result) -> final_results

print(final_results)

How do I make my create a dataframe with multiple categorical variables and interaction effects, grouped by ID?

Question

2 answers

solution1
1 2019-05-22 20:20:56

solution2
1 2019-05-22 21:06:39

How do I make my create a dataframe with multiple categorical variables and interaction effects, grouped by ID?

Question

2 answers

solution1 1 2019-05-22 20:20:56

solution2 1 2019-05-22 21:06:39

solution1
1 2019-05-22 20:20:56

solution2
1 2019-05-22 21:06:39