简体   繁体   中英

How to extract specific words from a string with pattern in R

I have a dataframe which contains the names of supervisors and advisors of students' dissertations in a faculty as follows for example:

 DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
  "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
  "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

I gonna separate supervisors and advisors as two distinct columns (as my expectation) like this:

DF1<-data.frame(Supervisor=c("Ali Ahmadi","Ali Ahmadi","Ali Ahmadi"),Advisors=c("Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi"))

DF1
  Supervisor                                             Advisors
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi

I tried following codes:

DF1<-strsplit(DF$Names, "Name :")

stopwords = c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")

DF2 <- lapply(DF1,function(x) unlist(strsplit(x," ")) )

DF3 <- lapply(DF2,function(x)  x[!x %in% stopwords] )

DF4<-lapply(DF3,function(x)  paste(x, collapse = " "))

But the final results as follows is not what was my expectation and apparently need further work to be converted to a datataframe!:

DF4
[[1]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"

[[2]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"

[[3]]
[1] " Ali , Ahmadi , First supervisor  Aram , Rezaeei ,  Omid , Saeedi ,  Nima , Shaki ,  Sohrab , Karimi ,"

Is there any simplified method to solve the problem? I found regexp can be helpful but I don't know how to use it atleast in the case of my example. Thanks in advance for any answer...

Here's an attempt with extract :

library(tidyr)
DF %>%
  # clean strings:
  mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>%
  # extract data into columns:
  extract(Names,
          into = c("Supervisor", "Advisor"),
          regex = "(\\w+\\s\\w+)\\s(.*)") %>%
  # insert commas into `Advisor`:
  mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE))
  Supervisor                                              Advisor
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi

Explanation (as requested by OP):

The regular expression in extract 's regex expression is designed to do two tasks:

  • (i) it must describe the string as a whole, from beginning to end
  • (ii) it must pick out those elements that should populate the newly created columns

Task (i) is achieved in that (\\w+\\s\\w+) captures the two words that make up the Supvervisor name, while \\s describes (but does not capture) the following whitespace and (.*) describes/matches anything that follows that whitespace - ie, in this case the four Advisor names.

Task (ii) is achieved by wrapping the Supvervisor name and the Advisor names in capturing groups given in parentheses; these parentheses are the 'syntax' by which the function extract 'realizes' that their content should go into the new columns.

The commas finally are inserted between the Advisor names again using a capturing group, which can be recollected in gsub 's replacment argument using backreference ( \\1 ). The (?!$) expression is a negative lookahead to assert that the comma is to be inserted only if what follows the word boundary anchor \\b is not (hence the ! in the lookahead) the end of the string (expressed in $ ). Hope this helps!

Data:

DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

Here is a base R solution.

DF <- data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
                       "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

stopwords <- c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")
stoppattern <- paste(stopwords, collapse = "|")

DF1 <- strsplit(DF$Names, "Name :")
DF1 <- lapply(DF1, \(x) trimws(x[sapply(x, nchar) > 0L]))

DF2 <- lapply(DF1, \(x) {
  gsub(stoppattern, "", x)
})

DF3 <- lapply(DF2, \(x) {
  y <- gsub(stoppattern, "", x)
  y <- strsplit(x, ",")
  y <- lapply(y, trimws)
  lapply(y, \(.y) {
    .y <- trimws(.y)
    .y[sapply(.y, nchar) > 0L]
  })
})

DF4 <- lapply(DF3, \(x) {
  Supervisor <- x[[1]][1:2]
  Supervisor <- paste(trimws(Supervisor), collapse = " ")
  Advisors <- unlist(x[-1])
  Advisors <- paste(trimws(Advisors), collapse = ", ")
  data.frame(Supervisor, Advisors)
})

Final <- do.call(rbind, DF4)
Final
#>   Supervisor                                                 Advisors
#> 1 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
#> 2 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
#> 3 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi

Created on 2022-06-05 by the reprex package (v2.0.1)

Messy Base R:

# Store a vector of names: ir_names => character vector
ir_names <- c("Name", "Family", "Type")

# Compute it's lenght: ir_name_len => string scalar
ir_name_len <- length(ir_names)

# Compute the desired result: res => data.frame
res <- do.call(
  rbind, 
  lapply(
    strsplit(
      DF$Names,
      "Name\\s+\\:\\s+"
    ),
    function(x){
      y <- data.frame(tmp = unlist(strsplit(x, " , ")))
      ir1 <- setNames(
        data.frame(
          do.call(
            rbind, 
            lapply(
              split(
                y, 
                ceiling(seq_len(nrow(y))/ir_name_len)
              ), 
              t
            )
          ),
          row.names = NULL,
          stringsAsFactors = FALSE
        ),
        ir_names
      )
      ir2 <- transform(
        ir1,
        Name = trimws(paste(Name, gsub("Family\\s+\\:\\s+", "", Family))),
        Type = trimws(gsub("Type\\s+\\:\\s+", "", Type))
      )[,c("Name", "Type")]
      ir3 <- data.frame(
        Supervisor = ir2$Name[which(grepl("supervisor", ir2$Type))],
        Advisor = toString(ir2$Name[-which(grepl("supervisor", ir2$Type))]),
        stringsAsFactors = FALSE,
        row.names = NULL
      )
    }
  )
)
# Print to console: data.frame => stdout(console)
res

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM