[英]How to extract specific words from a string with pattern in R
我有一个数据框,其中包含教师中学生论文的导师和顾问的姓名,例如:
DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))
我会将主管和顾问分成两个不同的列(如我所愿),如下所示:
DF1<-data.frame(Supervisor=c("Ali Ahmadi","Ali Ahmadi","Ali Ahmadi"),Advisors=c("Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi"))
DF1
Supervisor Advisors
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
我尝试了以下代码:
DF1<-strsplit(DF$Names, "Name :")
stopwords = c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")
DF2 <- lapply(DF1,function(x) unlist(strsplit(x," ")) )
DF3 <- lapply(DF2,function(x) x[!x %in% stopwords] )
DF4<-lapply(DF3,function(x) paste(x, collapse = " "))
但最终结果如下所示不是我的预期,显然需要进一步的工作才能转换为数据框!:
DF4
[[1]]
[1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ,"
[[2]]
[1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ,"
[[3]]
[1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ,"
有没有简化的方法来解决这个问题? 我发现 regexp 可能会有所帮助,但至少在我的示例中我不知道如何使用它。 提前感谢您的任何回答...
这是一个使用extract
的尝试:
library(tidyr)
DF %>%
# clean strings:
mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>%
# extract data into columns:
extract(Names,
into = c("Supervisor", "Advisor"),
regex = "(\\w+\\s\\w+)\\s(.*)") %>%
# insert commas into `Advisor`:
mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE))
Supervisor Advisor
1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi
说明(根据 OP 的要求):
extract
的regex
表达式中的正则表达式旨在完成两项任务:
任务 (i) 是通过(\\w+\\s\\w+)
捕获组成Supvervisor
名称的两个词来实现的,而\\s
描述(但不捕获)以下空格,而(.*)
描述/ 匹配该空格后面的任何内容 - 即在本例中为四个Advisor
名称。
任务 (ii) 是通过将Supvervisor
名称和Advisor
名称包装在括号中给出的捕获组中来实现的; 这些括号是函数extract
“意识到”它们的内容应该进入新列的“语法”。
最后使用捕获组再次在Advisor
名称之间插入逗号,可以使用反向引用( \\1
)在gsub
的替换参数中重新收集该逗号。 (?!$)
表达式是一个否定的前瞻,它断言只有当单词边界锚\\b
后面的内容不是(因此前瞻中的!
)字符串的结尾(以$
表示)时才插入逗号)。 希望这可以帮助!
数据:
DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))
这是一个基本的 R 解决方案。
DF <- data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3",
"Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))
stopwords <- c(":","Type","Family","Name","1","2", "3", "Advisor", "Family")
stoppattern <- paste(stopwords, collapse = "|")
DF1 <- strsplit(DF$Names, "Name :")
DF1 <- lapply(DF1, \(x) trimws(x[sapply(x, nchar) > 0L]))
DF2 <- lapply(DF1, \(x) {
gsub(stoppattern, "", x)
})
DF3 <- lapply(DF2, \(x) {
y <- gsub(stoppattern, "", x)
y <- strsplit(x, ",")
y <- lapply(y, trimws)
lapply(y, \(.y) {
.y <- trimws(.y)
.y[sapply(.y, nchar) > 0L]
})
})
DF4 <- lapply(DF3, \(x) {
Supervisor <- x[[1]][1:2]
Supervisor <- paste(trimws(Supervisor), collapse = " ")
Advisors <- unlist(x[-1])
Advisors <- paste(trimws(Advisors), collapse = ", ")
data.frame(Supervisor, Advisors)
})
Final <- do.call(rbind, DF4)
Final
#> Supervisor Advisors
#> 1 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
#> 2 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
#> 3 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi
由reprex 包(v2.0.1) 创建于 2022-06-05
凌乱的基地R:
# Store a vector of names: ir_names => character vector
ir_names <- c("Name", "Family", "Type")
# Compute it's lenght: ir_name_len => string scalar
ir_name_len <- length(ir_names)
# Compute the desired result: res => data.frame
res <- do.call(
rbind,
lapply(
strsplit(
DF$Names,
"Name\\s+\\:\\s+"
),
function(x){
y <- data.frame(tmp = unlist(strsplit(x, " , ")))
ir1 <- setNames(
data.frame(
do.call(
rbind,
lapply(
split(
y,
ceiling(seq_len(nrow(y))/ir_name_len)
),
t
)
),
row.names = NULL,
stringsAsFactors = FALSE
),
ir_names
)
ir2 <- transform(
ir1,
Name = trimws(paste(Name, gsub("Family\\s+\\:\\s+", "", Family))),
Type = trimws(gsub("Type\\s+\\:\\s+", "", Type))
)[,c("Name", "Type")]
ir3 <- data.frame(
Supervisor = ir2$Name[which(grepl("supervisor", ir2$Type))],
Advisor = toString(ir2$Name[-which(grepl("supervisor", ir2$Type))]),
stringsAsFactors = FALSE,
row.names = NULL
)
}
)
)
# Print to console: data.frame => stdout(console)
res
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.