[英]Fuzzy matching in Google Sheets
Trying to compare two columns in GoogleSheets with this formula in Column C:尝试将 GoogleSheets 中的两列与 C 列中的此公式进行比较:
=if(A1=B1,"","Mismatch")
Works fine, but I'm getting a lot of false positives:工作正常,但我得到了很多误报:
A.一个。 | B乙 | C C |
---|---|---|
MARY JO玛丽乔 | Mary Jo玛丽乔 | |
JAY, TIM杰伊,蒂姆 | TIM JAY蒂姆·杰 | Mismatch不匹配 |
Sam Ron山姆·罗恩 | Sam Ron山姆·罗恩 | Mismatch不匹配 |
Jack *Ma杰克*马 | Jack MA杰克马 | Mismatch不匹配 |
Any ideas how to work this?任何想法如何工作?
Implementing fuzzy matching via Google Sheets formula would be difficult.通过 Google 表格公式实现模糊匹配会很困难。 I would recommend using a custom formula for this one or a full blown script (both via Google Apps Script) if you want to populate all rows at once.如果您想一次填充所有行,我建议为此使用自定义公式或完整的脚本(均通过 Google Apps 脚本)。
function fuzzyMatch(string1, string2) {
string1 = string1.toLowerCase()
string2 = string2.toLowerCase();
var n = -1;
for(i = 0; char = string2[i]; i++)
if (!~(n = string1.indexOf(char, n + 1)))
return 'Mismatch';
};
What this does is compare if the 2nd string's characters order is found in the same order as the first string.它的作用是比较第二个字符串的字符顺序是否与第一个字符串的顺序相同。 See sample data below for the case where it will return mismatch.有关将返回不匹配的情况,请参见下面的示例数据。
r
in it that isn't found at the first string thus correct order is not met.最后一行是不匹配的,因为第二个字符串中有r
在第一个字符串中找不到,因此不符合正确的顺序。This uses a score based approach to determine a match.这使用基于分数的方法来确定匹配。 You can determine what is/isn't a match based on that score:您可以根据该分数确定什么是/不是匹配:
Score Formula = getMatchScore(A1,B1)
Match Formula = if(C1<0.7,"mismatch",)
function getMatchScore(strA, strB, ignoreCase=true) {
const toLowerCase = ignoreCase ? str => new String(str).toLowerCase() : str => str;
const splitWords = str => str.split(/\b/);
let [maxLenStr, minLenStr] = strA.length > strB.length ? [strA, strB] : [strB, strA];
maxLenStr = toLowerCase(maxLenStr);
minLenStr = toLowerCase(minLenStr);
const maxLength = maxLenStr.length;
const minLength = minLenStr.length;
const lenScore = minLength / maxLength;
const orderScore = Array.from(maxLenStr).reduce(
(oldItem, nItem, index) => nItem === minLenStr[index] ? oldItem + 1 : oldItem, 0
) / maxLength;
const maxKeyWords = splitWords(maxLenStr);
const minKeyWords = splitWords(minLenStr);
const keywordScore = minKeyWords.reduce(({ score, searchWord }, nItem) => {
const newSearchWord = searchWord?.replace(new RegExp(nItem, ignoreCase ? 'i' : ''), '');
score += searchWord.length != newSearchWord.length ? 1: 0;
return { score, searchWord: newSearchWord };
}, { score: 0, searchWord: maxLenStr }).score / minKeyWords.length;
const sortedMaxLenStr = Array.from(maxKeyWords.sort().join(''));
const sortedMinLenStr = Array.from(minKeyWords.sort().join(''));
const charScore = sortedMaxLenStr.reduce((oldItem, nItem, index) => {
const surroundingChars = [sortedMinLenStr[index-1], sortedMinLenStr[index], sortedMinLenStr[index+1]]
.filter(char => char != undefined);
return surroundingChars.includes(nItem)? oldItem + 1 : oldItem
}, 0) / maxLength;
const score = (lenScore * .15) + (orderScore * .25) + (charScore * .25) + (keywordScore * .35);
return score;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.