简体   繁体   English

Google 表格中的模糊匹配

[英]Fuzzy matching in Google Sheets

Trying to compare two columns in GoogleSheets with this formula in Column C:尝试将 GoogleSheets 中的两列与 C 列中的此公式进行比较:

=if(A1=B1,"","Mismatch")

Works fine, but I'm getting a lot of false positives:工作正常,但我得到了很多误报:

A.一个。 B C C
MARY JO玛丽乔 Mary Jo玛丽乔
JAY, TIM杰伊,蒂姆 TIM JAY蒂姆·杰 Mismatch不匹配
Sam Ron山姆·罗恩 Sam Ron山姆·罗恩 Mismatch不匹配
Jack *Ma杰克*马 Jack MA杰克马 Mismatch不匹配

Any ideas how to work this?任何想法如何工作?

Implementing fuzzy matching via Google Sheets formula would be difficult.通过 Google 表格公式实现模糊匹配会很困难。 I would recommend using a custom formula for this one or a full blown script (both via Google Apps Script) if you want to populate all rows at once.如果您想一次填充所有行,我建议为此使用自定义公式或完整的脚本(均通过 Google Apps 脚本)。

Custom Formula:自定义公式:

function fuzzyMatch(string1, string2) {
  string1 = string1.toLowerCase()
  string2 = string2.toLowerCase();
  var n = -1;

  for(i = 0; char = string2[i]; i++)
    if (!~(n = string1.indexOf(char, n + 1))) 
      return 'Mismatch';
};

What this does is compare if the 2nd string's characters order is found in the same order as the first string.它的作用是比较第二个字符串的字符顺序是否与第一个字符串的顺序相同。 See sample data below for the case where it will return mismatch.有关将返回不匹配的情况,请参见下面的示例数据。

Output: Output:

输出

Note:笔记:

  • Last row is a mismatch as 2nd string have r in it that isn't found at the first string thus correct order is not met.最后一行是不匹配的,因为第二个字符串中有r在第一个字符串中找不到,因此不符合正确的顺序。
  • If this didn't meet your test cases, add a more definitive list that will show the expected output of the formula/function so this can be adjusted, or see player0's answer which solely uses Google Sheets formula and is less stricter with the conditions.如果这不符合您的测试用例,请添加一个更明确的列表,该列表将显示公式/函数的预期 output 以便可以对其进行调整,或者查看仅使用 Google 表格公式且条件不那么严格的 player0 的答案。

Reference:参考:

try:尝试:

=ARRAYFORMULA(IFERROR(IF(LEN(
 REGEXREPLACE(REGEXREPLACE(LOWER(A1:A), "[^a-z ]", ), 
 LOWER("["&B1:B&"]"), ))>0, "mismatch", )))

在此处输入图像描述

This uses a score based approach to determine a match.这使用基于分数的方法来确定匹配。 You can determine what is/isn't a match based on that score:您可以根据该分数确定什么是/不是匹配:

在此处输入图像描述

Score Formula = getMatchScore(A1,B1)
Match Formula = if(C1<0.7,"mismatch",)
function getMatchScore(strA, strB, ignoreCase=true) {
  const toLowerCase = ignoreCase ? str => new String(str).toLowerCase() : str => str;
  const splitWords = str => str.split(/\b/);
  let [maxLenStr, minLenStr] = strA.length > strB.length ? [strA, strB] : [strB, strA]; 
  
  maxLenStr = toLowerCase(maxLenStr);
  minLenStr = toLowerCase(minLenStr);

  const maxLength = maxLenStr.length;
  const minLength = minLenStr.length;
  const lenScore = minLength / maxLength;

  const orderScore = Array.from(maxLenStr).reduce(
    (oldItem, nItem, index) => nItem === minLenStr[index] ? oldItem + 1 : oldItem, 0
  ) / maxLength;

  const maxKeyWords = splitWords(maxLenStr);
  const minKeyWords = splitWords(minLenStr);

  const keywordScore = minKeyWords.reduce(({ score, searchWord }, nItem) => {
    const newSearchWord = searchWord?.replace(new RegExp(nItem, ignoreCase ? 'i' : ''), '');
    score += searchWord.length != newSearchWord.length ? 1: 0;

    return { score, searchWord: newSearchWord };
  }, { score: 0, searchWord: maxLenStr }).score / minKeyWords.length;

  const sortedMaxLenStr = Array.from(maxKeyWords.sort().join(''));
  const sortedMinLenStr = Array.from(minKeyWords.sort().join(''));

  const charScore = sortedMaxLenStr.reduce((oldItem, nItem, index) => { 
    const surroundingChars = [sortedMinLenStr[index-1], sortedMinLenStr[index], sortedMinLenStr[index+1]]
    .filter(char => char != undefined);
    
    return surroundingChars.includes(nItem)? oldItem + 1 : oldItem
  }, 0) / maxLength;

  const score = (lenScore * .15) + (orderScore * .25) + (charScore * .25) + (keywordScore * .35);

  return score;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM