简体   繁体   English

如何在R中使用模糊匹配来加入数据?

[英]How can I join data using a fuzzy match in R?

I have some subject and license data and would like to create a column that flags whether the license is an appropriate one given the subject listed. 我有一些主题和许可证数据,并且想创建一列,以针对列出的主题标记该许可证是否合适。 The additional challenge is that some teachers teach multiple subjects, separated by semi-colons and there are several acceptable subjects for each license. 另一个挑战是,有些老师教多个科目,用分号隔开,每个许可都有几个可接受的科目。

I think I need to incorporate something like grep, but I'm not quite sure how to add this function while also joining the data from two tables. 我想我需要合并grep之类的东西,但是我不确定如何添加此功能,同时还要将两个表中的数据连接起来。

Sample code 样例代码

Below is an excerpt of my dataframe: 以下是我的数据框的摘录:

df1 <- data.frame(Subject = c("Spanish Language Arts; I teach all subjects for my students", 
"Math; Science", "Mathematics; ELA", "ELA", "Science;Math;English Language Arts", 
"Spanish Language Arts; I teach all subjects for my students",
 "Math", "Science;Social Studies;Mathematics;English Language Arts", "ELA", 
"English Language Arts"), 
Licensure = c("Content Area - Early Childhood (preK-Grade 3)", 
"Core Subjects (Grades EC-6) 1770", "Mathematics (Grades 7-12) 1706", 
"English Language Arts and Reading (Grades 7-12) 1709", "Core Subjects (Grades EC-6) 1770", 
"English Language Arts and Reading (Grades 7-12) 1709", 
"English Language Arts and Reading (Grades 7-12) 1709", 
"Content Area - Elementary Education (Grades 1-6)", 
"Mathematics (Grades 7-12) 1706", "Content Area - Elementary Education (Grades 1-6)"))

Here is the list I created that includes all of the licenses with the acceptable programs underneath each: 这是我创建的列表,其中包括所有许可以及每个许可下方的可接受程序:

lic.subject_index <- list(
  "Content Area - Early Childhood (preK-Grade 3)" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
  "Content Area - Elementary Education (Grades 1-6)" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
  "Core Subjects (Grades EC-6) 1770" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
  "English Language Arts and Reading (Grades 7-12) 1709" = c("ELA", "English Language Arts", "Language Arts"),
  "Mathematics (Grades 7-12) 1706" = c("Math", "Mathematics")
)

What I would like to be able to do is create a column that flags whether the subject/license combination is acceptable or not: 我想做的是创建一个列,以标记主题/许可证组合是否可接受:

ideal.df <- data.frame(Subject = c("Spanish Language Arts; I teach all subjects for my students", 
"Math; Science", "Mathematics; ELA", "ELA", "Science;Math;English Language Arts", 
"Spanish Language Arts; I teach all subjects for my students", "Math", 
"Science;Social Studies;Mathematics;English Language Arts", "ELA", "English Language Arts"), 
Licensure = c("Content Area - Early Childhood (preK-Grade 3)", "Core Subjects (Grades EC-6) 1770", 
"Mathematics (Grades 7-12) 1706", "English Language Arts and Reading (Grades 7-12) 1709", 
"Core Subjects (Grades EC-6) 1770", "English Language Arts and Reading (Grades 7-12) 1709", 
"English Language Arts and Reading (Grades 7-12) 1709", "Content Area - Elementary Education (Grades 1-6)", 
"Mathematics (Grades 7-12) 1706", "Content Area - Elementary Education (Grades 1-6)"), 
flag = c("True", "True", "True", "True", "True", "False", "False", "True", "False", "True"))

Thank you in advance for any help you can provide! 预先感谢您提供的任何帮助!

Here is an option with tidyverse and fuzzyjoin 这是带有tidyversefuzzyjoin的选项

library(fuzzyjoin)
library(tidyverse)
out <- df1 %>%
       rownames_to_column('rn') %>% 
       separate_rows(Subject, sep = ';') %>% 
       stringdist_left_join(
         enframe(lic.subject_index, name = 'Licensure', value = 'Subject') %>% 
              unnest) %>% 
       group_by(rn = as.integer(rn)) %>%
       summarise(ind = any(!is.na(Licensure.y))) %>%
       ungroup %>% 
       pull(ind) %>% 
       mutate(df1, flag = .)
out$flag
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE

-checking OP''s ideal output -检查OP的理想输出

as.logical(ideal.df$flag)
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM