简体   繁体   English

使用 BigQuery 进行字符串比较

[英]String comparison using BigQuery

I have a table with people and their hometown names, but there are same cities with different written, see:我有一张人和他们家乡名字的桌子,但有不同的书面相同的城市,见:

Name名称 Hometown家乡
João若昂 São Paulo圣保罗
Maria玛丽亚 Sao Paul圣保罗
Pedro佩德罗 São Paulo.圣保罗。
Maria玛丽亚 S. Paulo圣保罗

And I need to process this in order to formalize that data to be like this:我需要对此进行处理,以便将该数据形式化为如下所示:

Name名称 Hometown家乡
João若昂 São Paulo圣保罗
Maria玛丽亚 São Paulo圣保罗
Pedro佩德罗 São Paulo圣保罗
Maria玛丽亚 São Paulo圣保罗
  • The dataset has more than 2400 distinct values so I can't hard code.数据集有超过 2400 个不同的值,所以我不能硬编码。
  • I have a Country table dimension with all cities and their correct names.我有一个包含所有城市及其正确名称的国家/地区表维度。

I tried this stack and would it be exactly what I need but does not work with my entire dataset.我尝试了这个堆栈,它是否正是我需要的,但不适用于我的整个数据集。

Consider below approach (considering you have lookup table with all proper cities names) for purpose of example - I have it as CTE with just few ones考虑以下方法(考虑到您有包含所有适当城市名称的查找表)作为示例 - 我将它作为 CTE 只有几个

with cities as (
  select 'São Paulo' as city union all 
  select 'Los Angeles' union all 
  select 'Dnipro' union all 
  select 'Kyiv'
)
select Name, City as Hometown
from your_table 
left join cities 
on soundex(Hometown) = soundex(city)      

if applied to sample data in your question - output is如果应用于您问题中的示例数据 - output 是

在此处输入图像描述

Note: you obviously need to take care of potential duplication in case if some cities sounds similar, in this case adding country constraints might help...注意:您显然需要注意潜在的重复,以防某些城市听起来相似,在这种情况下,添加国家/地区限制可能会有所帮助...

First, the basics.首先,基础知识。

  1. Strip non-letters.剥离非字母。
  2. Case fold.案例折叠。
  3. Convert to ASCII equivalents.转换为 ASCII 等效项。

The first one is straight-forward, strip out everything which isn't a letter so São Paulo and São Paulo.第一个是直截了当的,去掉所有不是字母的东西,所以São PauloSão Paulo. are both SãoPaulo .都是SãoPaulo

Case folding is also straight-forward, change everything to lower or upper case.案例折叠也很简单,将所有内容更改为小写或大写。 são paulo and São Paulo compare the same. são pauloSão Paulo比较一样。

Finally, convert them to the normal ASCII equivalents.最后,将它们转换为正常的 ASCII 等价物。 For example, são becomes sao .例如, são变成sao

With this normalization done, the issues of spaces, extra characters, accents, and cases are taken are of.完成此规范化后,空格、额外字符、重音和大小写的问题就被解决了。 I would recommend doing this outside of BigQuery and in a language like Python. Do a select distinct and transform and compare each value using libraries such as unidecode .我建议在 BigQuery 之外使用 Python 之类的语言执行此操作。执行select distinct并使用unidecode等库转换和比较每个值。


You can then employ some heuristics to try and find "close enough" matches.然后您可以使用一些试探法来尝试找到“足够接近”的匹配项。 One example is the Levenshtein distance which is the number of substitutions, insertions, and deletions one needs to do to turn one string into another.一个例子是Levenshtein 距离,它是将一个字符串转换为另一个字符串所需的替换、插入和删除次数。 Python has a Levenshtein library . Python 有 Levenshtein 图书馆

For example, Sao Paul and Sao Paulo have a Levenshtein distance of one;例如, Sao PaulSao Paulo的 Levenshtein 距离为 1; add one letter.加一个字母。 S Paulo and Sao Paulo have a Levenshtein distance of two, add two letters. S PauloSao Paulo的 Levenshtein 距离为 2,添加两个字母。 Sao Paulo and Saint Paul have a Levenshtein distance of four; Sao PauloSaint Paul的编辑距离为 4; change o to i, add n and t, remove o.将 o 更改为 i,添加 n 和 t,删除 o。

Again, I'd recommend doing this with a regular programming language and then writing the normalized results back to BigQuery.同样,我建议使用常规编程语言执行此操作,然后将规范化结果写回 BigQuery。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM