简体   繁体   中英

R: Replace string when partial match to another column by row

I would like to replace/remove those parts of a string ( name ) that match to other columns ( state and city ) in my data table.

I managed to identify the rows, eg with city, like so: dt%>% filter(str_detect(name, city)) but I am missing a way to use gsub (or grep ) with the rowwise value of the column city.

I know that a rather manual approach like storing all city names in a vector and enter them in gsub would work but it would also falsely remove the "dallas" of row 2. (This was manageable for states though and could be combined with gsub to also remove "of".)


Data and desired output

dt<- data.table(city = c("arecibo","arecibo","cabo rojo", "new york", "dallas"), 
state=c("pr", "pr", "pr", "ny", "tx"), 
name=c("frutas of pr arecibo", "dallas frutas of pr", "cabo rojo metal plant", "greens new york", "cowboy shoes dallas tx"), 
desired=c("frutas", "dallas frutas", "metal plant", "greens", "cowboy shoes"))

Here's a solution, but it can probably be achieved faster with gsub methods. Anyway:

library(tidyverse)
dt %>% 
  mutate(test = str_remove_all(name,city)) %>% 
  mutate(test = str_remove_all(test,state)) %>% 
  mutate(test = str_remove_all(test," of ")) %>% 
  mutate(test = str_remove_all(test,"^ ")) %>% 
  mutate(test = str_remove_all(test," *$"))

Output:

        city state                   name       desired          test
1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
4:  new york    ny        greens new york        greens        greens
5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes

A data.table solution:

# Helper function
subxy <-  function(string, rmv) mapply(function(x, y) sub(x, '', y), rmv, string)

dt[,  desired2 := name |> subxy(city) |> subxy(state) |> subxy('of') |> trimws()]

#         city state                   name       desired      desired2
# 1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
# 2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
# 3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
# 4:  new york    ny        greens new york        greens        greens
# 5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes

library(dplyr)

dt %>% rowwise() %>%
    mutate(desired_2 = str_remove_all(name, paste(c(city, state, 'of'), collapse = '|'))%>%
               trimws())

# A tibble: 5 × 5
# Rowwise: 
  city      state name                   desired       desired_2    
  <chr>     <chr> <chr>                  <chr>         <chr>        
1 arecibo   pr    frutas of pr arecibo   frutas        frutas       
2 arecibo   pr    dallas frutas of pr    dallas frutas dallas frutas
3 cabo rojo pr    cabo rojo metal plant  metal plant   metal plant  
4 new york  ny    greens new york        greens        greens       
5 dallas    tx    cowboy shoes dallas tx cowboy shoes  cowboy shoes 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM