简体   繁体   中英

Splitting a string by 2 different conditions

I have a list of names (famous directors) that is in format of First, (possible middle), and Last Name which I need to rearrange to have Last Name, First (possible middle). I can't just split all of these by the first space, or even second space since some last names actually have two words.

Here is an shortened example of the list I'm working with which shows some of the tricky situations:

> directors.names
  [1] "Frank Darabont,"                   "Francis Ford Coppola,"                        
  [3] "Christopher Nolan,"                "Carl Theodor Dreyer,"                                    
  [5] "Peter Jackson,"                    "Quentin Tarantino,"                                  
  [7] "John G. Avildsen,"                 "David Fincher,"                                   
  [9] "Christopher Nolan,"                "Peter Jackson,"                                    
 [11] "Lana Wachowski,"                   "Martin Scorsese,"                                    
 [13] "Akira Kurosawa,"                   "Bong Joon Ho,"                                      
 [14] "Fernando Meirelles,"               "Florian Henckel von Donnersmarck,"   

In this example, I would need to split "John G. Avildsen" after the G., but then "Bong Joon Ho" after the first space, and even more so, "Florian Henckel von Donnersmarck" after the 2nd space (just to point out a couple).

I've added a comma to the end of all strings so that I can then transpose the strings and have it return Last Name, First (possible middle) format.

I went through my list and found all the situations where there is something that would need to remain with the last name portion to try and those ones split first, but it isn't splitting where I need it to, it's just splitting each string into it's own index.

Here is what I have right now:

directors.names <- paste0(directors.1, ",")
directors.names <- strsplit(directors.names, "[[:space:]]+('von'|'Ford'|'Joon'|'De'|'del'|'Van')[[:space:]]+", perl = TRUE)  

Once I can get this to work correctly, I'll then need to remove any duplicate names

We know the patterns to extract (first word, last word, and ocasioanally a two-word last name), so we may fare better with an extract rather than a split approach, because we do not know the number of words for every name (it would be difficult to split on the nth whitespace).

We can define a pattern for common two-word last names, then insert this pattern with glue::glue inside str_extract_all .

In the following call to str_extract_all , we definde 3 possible patterns to extract:

  • a first word ^\\w+
  • a two-word last name (({two_word_patterns})\\s+\\w+$)
  • a regular last name \\w+$

These three should be collapsed with |as the separator, all within the regex (no ticks in between).

After extracting the names, we can reverse the order with rev() , and, finally, paste them back together with toString .

toString is specifically useful when we need to paste character elements with a ", " separator, like in this case.

library(glue)
library(stringr)
libyrar(purrr)

two_word_patterns<-'(von)|(Ford)|(Joon)|(De)|(del)|(Van)'

directors<-c("Fernando Meireles", "Bong Joon Ho", "Florian Henckel von Donnersmarck")

directors %>% str_extract_all(pattern = glue('^\\w+|(({two_word_patterns})\\s+\\w+$)|\\w+$'))%>%
    map(rev) %>%
    map(toString)

[[1]]
[1] "Meireles, Fernando"

[[2]]
[1] "Joon Ho, Bong"

[[3]]
[1] "von Donnersmarck, Florian"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM