[英]finding strings that appear in one column but missing from another based on a list of strings
我有一個 df,數據:
data = data.frame("text" = c("John met Jay who met Jack who met Josh who met Jamie", "John and Jay and Jack and Josh and Jamie"),
"names.in.text" = c("Jay; Jack; Josh; Jamie", "John; Jack; Josh; Jamie"),
"missing.names" = c("",""))
> data
text names.in.text missing.names
1 John met Jay who met Jack who met Josh who met Jamie Jay; Jack; Josh; Jamie
2 John and Jay and Jack and Josh and Jamie John; Jack; Josh; Jamie
和第二個 df 的名字:
names = data.frame("names" = c("John", "Jay", "Jack", "Josh", "Jamie"))
> names
names
1 John
2 Jay
3 Jack
4 Josh
5 Jamie
我試圖找出 data$names.in.text 是否包含 data$text 中包含的所有名稱。 名字的世界在名字$名字中。 理想情況下,對於每一行 data$missing,我想知道 data$names.in.text 中缺少哪些 names$names:
text names.in.text missing.names
1 John met Jay who met Jack who met Josh who met Jamie Jay; Jack; Josh; Jamie John
2 John and Jay and Jack and Josh and Jamie John; Jack; Josh; Jamie Jay
或者任何其他可以輕松告訴我文本中有哪些名稱但 names.in.text 中缺少的配置
所以本質上我是在尋找哪些名稱$名稱包含在數據$文本中但不包含在數據$名稱.in.文本中,然后在數據$缺失.名稱中列出這些名稱。
一個tidyverse
的解決方案:
library(tidyverse)
data %>%
mutate(missing.names = map2_chr(text, str_split(names.in.text, '; '),
~ str_c(str_extract_all(.x, regex(str_c(setdiff(names$names, .y), collapse = '|')))[[1]], collapse = '; ')))
# # A tibble: 2 × 3
# text names.in.text missing.names
# <chr> <chr> <chr>
# 1 John met Jay who met Jack who met Josh who met Jamie Jay; Jack; Josh; Jamie John
# 2 John and Jay and Jack and Josh and Jamie John; Jack; Josh; Jamie Jay
使用apply
/ sapply
的基本 R方法。 我用“JJ”擴展了第一個文本和名字,以顯示缺少的多個名字。
data$missing.names <- apply(sapply(names$names, function(nms)
grepl(paste0("\\b",nms,"\\b"), data$text) &
!grepl(nms, data$names.in.text)), 1, function(x)
paste(names$names[x], collapse=", "))
data
text
1 John met Jay who met Jack who met Josh who met Jamie JJ
2 John and Jay and Jack and Josh and Jamie
names.in.text missing.names
1 Jay; Jack; Josh; Jamie John, JJ
2 John; Jack; Josh; Jamie Jay
data <- structure(list(text = c("John met Jay who met Jack who met Josh who met Jamie JJ",
"John and Jay and Jack and Josh and Jamie"), names.in.text = c("Jay; Jack; Josh; Jamie",
"John; Jack; Josh; Jamie")), class = "data.frame", row.names = c(NA,
-2L))
names <- structure(list(names = c("John", "Jay", "Jack", "Josh", "Jamie",
"JJ")), class = "data.frame", row.names = c(NA, -6L))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.