簡體   English   中英

使用帶有 grepl 和循環的名稱列表從字符串中提取名稱,並將它們添加到 R 中的新列

[英]Extract names from a string using a list of names with grepl and a loop and add them to a new column in R

我有一個數據集,其中一列包含姓名,一列指示該人白天做了什么。 我正在嘗試使用 R 找出那天在我的數據集中誰會見了誰。我創建了一個包含數據集中名稱的向量,並在循環中使用 grepl 來確定名稱出現在詳細說明人們活動的列中的位置在數據集中。

name <- c("Dupont","Dupuy","Smith") 

activity <- c("On that day, he had lunch with Dupuy in London.", 
              "She had lunch with Dupont and then went to Brighton to meet Smith.", 
              "Smith remembers that he was tired on that day.")

met_with <- c("Dupont","Dupuy","Smith")

df<-data.frame(name, activity, met_with=NA)


for (i in 1:length(met_with)) {
df$met_with<-ifelse(grepl(met_with[i], df$activity), met_with[i], df$met_with)
}

然而,由於兩個原因,該解決方案並不令人滿意。 當這個人遇到一個以上的人時,我不能提取一個以上的名字(在我的例子中是 Dupuy),我不能告訴 R 在我的名字中使用這個名字而不是代詞時不要返回這個人的名字活動列(例如史密斯)。

理想情況下,我希望 df 看起來像:

  name         activity                                            met_with                             
  Dupont       On that day, he had lunch with Dupuy in London.     Dupuy
  Dupuy        She had lunch with Dupont and then (...).           Dupont Smith
  Smith        Smith remembers that he was tired on that day.      NA

我正在清理字符串以構建邊緣列表和節點列表,以便稍后進行網絡分析。

謝謝

您可以使用setdiff排除要與行匹配的名稱,並使用gregexprregmatches提取匹配的名稱。 也許也可以考慮在名稱周圍加上\\\\b

for(i in seq_len(nrow(df))) {
  df$met_with[i] <- paste(regmatches(df$activity[i],
   gregexpr(paste(setdiff(name, df$name[i]), collapse="|"),
   df$activity[i]))[[1]], collapse = " ")
}

df
#    name                                                           activity     met_with
#1 Dupont                    On that day, he had lunch with Dupuy in London.        Dupuy
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont Smith
#3  Smith                     Smith remembers that he was tired on that day.             

另一種使用Reduce可能是:

df$met_with <- Reduce(function(x, y) {
  i <- grepl(y, df$activity, fixed = TRUE) & y != df$name
  x[i] <- lapply(x[i], `c`, y)
  x
}, unique(name), vector("list", nrow(df)))

df
#    name                                                           activity      met_with
#1 Dupont                    On that day, he had lunch with Dupuy in London.         Dupuy
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith. Dupont, Smith
#3  Smith                     Smith remembers that he was tired on that day.          NULL

與@Gki 相同的邏輯,但使用stringr函數和mapply而不是循環。

library(stringr)

pat <- str_c('\\b', df$name, '\\b', collapse = '|')
df$met_with <- mapply(function(x, y) str_c(setdiff(x, y), collapse = ' '), 
       str_extract_all(df$activity, pat), df$name)

df

#    name                                                           activity
#1 Dupont                    On that day, he had lunch with Dupuy in London.
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith.
#3  Smith                     Smith remembers that he was tired on that day.

#      met_with
#1        Dupuy
#2 Dupont Smith
#3             

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM