简体   繁体   中英

Conditionally pasting values from one column to another in R

I'm struggling with some data wrangling stuff which I feel should be easy to solve (with lapply or something) but I just can't get it to work (and I'm evidently rusty as hell with this stuff). I'm working with a web-dataset structured as follows:

 df <- data.frame("ID" = c(1, 1, 1, 2, 3, 3),  
                 "URL_visit" = c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited 3 pages
                 "URL_name" = c("home", "login", "product_page", "home", "home", "product_page"),
                 "duration" = c(14, 40, 233, 8, 76, 561),
                 "home" = c(1, 0, 0, 1, 1, 0),
                 "login" = c(0, 1, 0, 0, 0, 0), 
                 "product_page" = c(0, 0, 1, 0, 0, 1)
                   )

So basically a customer ID field, a number for each event per customer, the URL they visited in that event, followed by a column for each URL with an indicator (1/0) whether the customer visited that particular URL in that event.

My goal is to have a piece of code that replaces the 1/0 indicators with the duration if a match (1) is found, and retains 0 if not. In other words:

  • (1) Row-wise checks the values of the URL-columns (cols 5:7 aka home, login, product_page in my example. I also have a vector with the column names) to be a 1 or 0
  • (2) If it finds a 1, inserts the value from the duration column, otherwise 0.

Either replacing all the current (0/1) values in the URL-columns with duration where applicable, or making a new set of columns (eg, "home_duration") works for me.

A manual example solution is:

df %<>% dplyr::mutate(home_duration = if_else(home == 1, duration, 0))

But of course my aim is to automate this, and conduct it for the whole set of URL-columns (passing a vector with URL column names).

Help is much appreciated: Thanks! :)

You can try pivoting it to long, doing your transformation, and then pivoting it back wide again.

library(dplyr)
library(tidyr)

url_col_names <- c("home", "login", "product_page")

df %>% 
  pivot_longer(url_col_names, names_to = "url", values_to = "url_duration") %>% 
  mutate(url_duration = url_duration * duration) %>% 
  pivot_wider(names_from = "url", values_from = "url_duration")
# A tibble: 6 x 7
     ID URL_visit URL_name     duration  home login product_page
  <dbl>     <dbl> <fct>           <dbl> <dbl> <dbl>        <dbl>
1     1         1 home               14    14     0            0
2     1         2 login              40     0    40            0
3     1         3 product_page      233     0     0          233
4     2         1 home                8     8     0            0
5     3         1 home               76    76     0            0
6     3         2 product_page      561     0     0          561

Another way, probably more simple, is to do this.

df %>% 
  mutate(across(any_of(url_col_names), ~ . * duration))
  ID URL_visit     URL_name duration home login product_page
1  1         1         home       14   14     0            0
2  1         2        login       40    0    40            0
3  1         3 product_page      233    0     0          233
4  2         1         home        8    8     0            0
5  3         1         home       76   76     0            0
6  3         2 product_page      561    0     0          561

Edit

On another note, I imagine you created those indicator variables? If you are just hoping to replace them, then you actually might not need to create them to begin with. You can just pivot_wider() from the start.

This would assume that your ID and URL_visit columns form a unique row combination.

df2 <- df[, 1:4]

df2 %>% 
  pivot_wider(names_from = "URL_name", values_from = "duration", values_fill = 0)

A simple multiplication should do the trick (this is equivalent to @Adam 's tidyverse solution above but in base R)

url_col_names <- c('home','login','product_page')    
df$duration * df[,url_col_names] -> df[,url_col_names]

To rename the columns, you can do:

names(df)[names(df) %in% url_col_names] <- paste0(url_col_names, '_', 'duration')
          

Similar to @Adam, across() can be used with ifelse() in order to compute variables using a similar structure as the user mentioned:

library(dplyr)
#Data
df <- data.frame("ID" = c(1, 1, 1, 2, 3, 3),  
                 "URL_visit" = c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited 3 pages
                 "URL_name" = c("home", "login", "product_page", "home", "home", "product_page"),
                 "duration" = c(14, 40, 233, 8, 76, 561),
                 "home" = c(1, 0, 0, 1, 1, 0),
                 "login" = c(0, 1, 0, 0, 0, 0), 
                 "product_page" = c(0, 0, 1, 0, 0, 1)
)
#Code
df %>%
  mutate(across(c(home:product_page), ~ ifelse(.==1, duration, .)))

Output:

  ID URL_visit     URL_name duration home login product_page
1  1         1         home       14   14     0            0
2  1         2        login       40    0    40            0
3  1         3 product_page      233    0     0          233
4  2         1         home        8    8     0            0
5  3         1         home       76   76     0            0
6  3         2 product_page      561    0     0          561

Also, if the original variables need to be kept, this code can help:

df %>%
  mutate(across(c(home:product_page),.fns = list(duration = ~ ifelse(.==1, duration, .)) ))

Output:

  ID URL_visit     URL_name duration home login product_page home_duration login_duration
1  1         1         home       14    1     0            0            14              0
2  1         2        login       40    0     1            0             0             40
3  1         3 product_page      233    0     0            1             0              0
4  2         1         home        8    1     0            0             8              0
5  3         1         home       76    1     0            0            76              0
6  3         2 product_page      561    0     0            1             0              0
  product_page_duration
1                     0
2                     0
3                   233
4                     0
5                     0
6                   561

Here is Data.table solution:

library(data.table)
url_values <- unique(df$URL_name)
for (val in url_values){
  setDT(df)[,home := ifelse(home == 1, duration, home)]
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM