简体   繁体   中英

Removing everything from a string column conditional on another string column in R

I try to clean up a column containing long speeches during a debate. Right now, every row starts with a new speaker, however, things like subheaders remain at the end of each speech, which is not desirable.

Here is some example data:

speeches <- tibble(subheader = c("3.Discussion", "8.Voting"),
                   full_speech = c("I close this part. 3.Discussion Let's start with",
                                   "I think we can vote now")
                   )

Desired Outcome:

subheader      full_speech
3. Discussion  I close this part.
8. Voting      I think we can vote now

What I tried so far:

speeches %>%
    mutate(full_speech = str_remove(full_speech, subheader))

But of course this only deletes the subheaders and not what follows after them.

We can paste the subheader with .* to match any characters that succeeds the subheader

library(dplyr)
library(stringr)
speeches %>% 
  mutate(full_speech = str_remove(full_speech, str_c("\\s+", 
      subheader, ".*")))

-output

# A tibble: 2 × 2
  subheader    full_speech            
  <chr>        <chr>                  
1 3.Discussion I close this part.     
2 8.Voting     I think we can vote now

An approach using sub and paste to construct the replacements from subheader .

library(dplyr)

speeches %>% 
  rowwise() %>%
  mutate(full_speech = gsub(
           paste0(" ", subheader, ".*", collapse=""), "", full_speech)) %>% 
  ungroup()
# A tibble: 2 × 2
  subheader    full_speech            
  <chr>        <chr>                  
1 3.Discussion I close this part.     
2 8.Voting     I think we can vote now

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM