简体   繁体   中英

Efficient way of keyword matching in R?

I am trying to match keywords between two large bibliographic datasets (1.8M obs and 3.9M obs), which are derived from various fields in the record: title, author, publication date, publisher.

For each entry (1.8M), I want to match each keyword in the string against keywords in each entry of the other dataset (3.9M), and return the line with the most matches.

The method I've come up with, using the separate() and gather() functions from tidyverse, along with some basic dplyr, seems to work, but it is impossible to scale to the entire dataset.

Is there a more efficient (or entirely better) way of doing this?

Sample data for three keyword and strings and code:

library(dplyr)
library(tidyverse)

df1 <- data.frame("df1.index" = c(1:3), 
                  "keywords" = c("2013 history interpretation oxford the tractatus univ wittgensteins", 
                                 "2014 baxi law of oxford pratiksha public secrets univ", 
                                 "2014 darwin flinching from looking on oxford scientific shell-shock"))


df2 <- data.frame("df2.index" = c(1:3), 
                  "keywords" = c("2014 darwin flinching from looking on oxford scientific theatricality",
                                 "2013 interpretation oxford tractatushistory univ wittgensteins", 
                                 "2014 baxi in india law of oxford pratiksha public rape secrets trials univ"))

#separate up to 10 keywords
df1_sep <- separate(df1, keywords, into = 
                      c("key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "key9", "key10"), 
                    sep = " ", remove = FALSE)
df2_sep <- separate(df2, keywords, into = 
                      c("key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "key9", "key10"), 
                    sep = " ", remove = FALSE)

#gather separated keywords into one column
df1_gather <- df1_sep %>% 
  gather(keys, key.match, key1:key10, factor_key = TRUE) %>% 
  distinct()
df2_gather <- df2_sep %>% 
  gather(keys, key.match, key1:key10, factor_key = TRUE) %>% 
  distinct()

#remove NAs, blanks, trim
df1_gather <- df1_gather %>% filter(!is.na(key.match))
df1_gather <- df1_gather %>% filter(key.match != "")
df1_gather$key.match <- str_trim(df1_gather$key.match)

df2_gather <- df2_gather %>% filter(!is.na(key.match))
df2_gather <- df2_gather %>% filter(key.match != "")
df2_gather$key.match <- str_trim(df2_gather$key.match)

#join, after removing some columns from df2_gather
df2_gather <- df2_gather %>% select(df2.index, key.match)

df_join <- left_join(df1_gather, df2_gather)

#remove NAs
df_join <- df_join %>% filter(!is.na(df2.index))

#tally matches for each index, then take top match
df_join <- df_join %>% group_by(df1.index, df2.index) %>% tally()
df_join <- df_join %>% group_by(df1.index) %>% top_n(1, n)

#add back keywords to review match 
df_join$df1.keywords=df1$keywords[match(df_join$df1.index, df1$df1.index)]
df_join$df2.keywords=df2$keywords[match(df_join$df2.index, df2$df2.index)] 

Maybe this approach could be useful to count using directly each keyword. I hope this can help:

library(tidytext)
#Separate
df1 %>% mutate(keywords=as.character(keywords)) %>% unnest_tokens(word,keywords) -> l1
df2 %>% mutate(keywords=as.character(keywords)) %>% unnest_tokens(word,keywords) -> l2
#Join
l1 %>% left_join(l2) -> l3
l2 %>% left_join(l1) -> l4
#Compute number of ocuurences
table(l3$df1.index,l3$df2.index,exclude=NULL)
table(l4$df1.index,l4$df2.index,exclude=NULL)

Output:

    1 2 3 <NA>
  1 1 5 2    3
  2 2 2 9    0
  3 8 1 2    2

       1 2 3
  1    1 5 2
  2    2 2 9
  3    8 1 2
  <NA> 1 1 4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM