简体   繁体   English

基于部分字符串匹配比较两个数据帧的两列

[英]Comparing two columns of two dataframes based on partial string match

I have two sample data frames, df1 and df2 as given below.我有两个示例数据框, df1df2 ,如下所示。 df1 has the list of selected tennis match fixtures with player names( player1_name , player_name2 ) and the date they were played. df1具有选定的网球比赛装置列表,其中包含球员姓名( player1_name , player_name2 )和比赛日期。 Full names are used here for players.全名在这里用于球员。

df2 has the list of all tennis match results( winner , loser ) for each date. df2列出了每个日期的所有网球比赛结果( winnerloser )。 Here, the first letter of first names and full last names are used.在这里,使用名字的第一个字母和完整的姓氏。 The player names for fixtures and for results were scraped from different websites.赛程和结果的球员姓名是从不同的网站上抓取的。 So there could be some cases where last names may not exactly match.因此,在某些情况下,姓氏可能不完全匹配。 Taking this into consideration, I would like to add a new column to df1 that says if player1 or player2 won.考虑到这一点,我想在df1中添加一个新列,说明是 player1 还是 player2 获胜。 Basically, I would want to map player1_name and player2_name from df1 to winner and loser from df2 by some means of partial matching given the same date.基本上,我想通过给定相同日期的某些部分匹配方式将player1_nameplayer2_namedf1映射到 df2 的获胜者和失败者。

dput(df1)
structure(list(date = structure(c(18534, 18534, 18534, 18534, 
18534, 18534, 18534), class = "Date"), player1_name = c("Laslo Djere", 
"Hugo Dellien", "Quentin Halys", "Steve Johnson", "Henri Laaksonen", 
"Thiago Monteiro", "Andrej Martin"), player2_name = c("Kevin Anderson", 
"Ricardas Berankis", "Marcos Giron", "Roberto Carballes", "Pablo Cuevas", 
"Nikoloz Basilashvili", "Joao Sousa")), row.names = c(NA, -7L
), class = "data.frame")
dput(df2)
structure(list(date = structure(c(18534, 18534, 18534, 18534, 
18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534, 
18534, 18534, 18534, 18534, 18534, 18534, 18534), class = "Date"), 
    winner = c("L Harris", "M Berrettini", "M Polmans", "C Garin", 
    "A Davidovich Fokina", "D Lajovic", "K Anderson", "R Berankis", 
    "M Giron", "A Rublev", "N Djokovic", "R Carballes Baena", 
    "A Balazs", "P Cuevas", "T Monteiro", "S Tsitsipas", "D Shapovalov", 
    "G Dimitrov", "R Bautista Agut", "A Martin"), loser = c("A Popyrin", 
    "V Pospisil", "U Humbert", "P Kohlschreiber", "H Mayot", 
    "G Mager", "L Djere", "H Dellien", "Q Halys", "S Querrey", 
    "M Ymer", "S Johnson", "Y Uchiyama", "H Laaksonen", "N Basilashvili", 
    "J Munar", "G Simon", "G Barrere", "R Gasquet", "J Sousa"
    )), row.names = c(NA, -20L), class = "data.frame")

I have created a custom function that can match a string to it's closest match from a string vector using RecordLinkage package.我创建了一个自定义函数,该函数可以使用 RecordLinkage 包将字符串与字符串向量中的最接近匹配项进行匹配。 I could possibly write a super inefficient code using this function but before going there, I'd want to see if I can do it in a more efficient manner.我可能会使用这个函数编写一个非常低效的代码,但在去那里之前,我想看看我是否可以以更有效的方式来完成。

ClosestMatch <- function(string, stringVector,max_threshold=0.5) {
        df<- character()
        for (i in 1:length(string)) {
                distance <- levenshteinSim(string[i], stringVector)
                if (max(distance)>=max_threshold) {
                        df[i]<- stringVector[which.max(distance)]
                }
                else {
                        df[i]= NA
                }
        }  
        return(df)
}

I gave it a go using stringdist :stringdist了一下使用stringdist

library(stringdist)

for (i in 1:nrow(df1)) {
  
  #this first part combines the names of player1 and player2
  #and finds the closest match to the player combinations in df2

  d <-
    stringdist(
      paste(df1$player1_name[i], df1$player2_name[i]),
      paste(df2$winner, df2$loser),
            method = "cosine")
  #I like using the cosine method as it returns a decimal as opposed to an integer


  #then, added winner and loser columns to df1 based on which row in df2 had the closest match
  #(i.e. lowest stringdist)
 
  df1$winner[i] <- df2[which(d == min(d)), 2]
  df1$loser[i] <- df2[which(d == min(d)), 3]
}

#adding another loop that makes the names in the winner/loser columns
#change to their closest match in the player1 and player2 columns

for(i in 1:nrow(df1)){
  n <- stringdist(df1$winner[i], c(df1$player1_name[i], df1$player2_name[i]), method = "cosine")
  if (n[1] > n[2]){df1$winner[i] <- df1$player2_name[i]
                   df1$loser[i] <- df1$player1_name[i]}
  if (n[1] < n[2]){df1$winner[i] <- df1$player1_name[i]
                   df1$loser[i] <- df1$player2_name[i]}
}

> df1
        date    player1_name         player2_name            winner                loser
1 2020-09-29     Laslo Djere       Kevin Anderson    Kevin Anderson          Laslo Djere
2 2020-09-29    Hugo Dellien    Ricardas Berankis Ricardas Berankis         Hugo Dellien
3 2020-09-29   Quentin Halys         Marcos Giron      Marcos Giron        Quentin Halys
4 2020-09-29   Steve Johnson    Roberto Carballes Roberto Carballes        Steve Johnson
5 2020-09-29 Henri Laaksonen         Pablo Cuevas      Pablo Cuevas      Henri Laaksonen
6 2020-09-29 Thiago Monteiro Nikoloz Basilashvili   Thiago Monteiro Nikoloz Basilashvili
7 2020-09-29   Andrej Martin           Joao Sousa     Andrej Martin           Joao Sousa

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM