[英]String comparison using BigQuery
I have a table with people and their hometown names, but there are same cities with different written, see:我有一张人和他们家乡名字的桌子,但有不同的书面相同的城市,见:
Name名称 | Hometown家乡 |
---|---|
João若昂 | São Paulo圣保罗 |
Maria玛丽亚 | Sao Paul圣保罗 |
Pedro佩德罗 | São Paulo.圣保罗。 |
Maria玛丽亚 | S. Paulo圣保罗 |
And I need to process this in order to formalize that data to be like this:我需要对此进行处理,以便将该数据形式化为如下所示:
Name名称 | Hometown家乡 |
---|---|
João若昂 | São Paulo圣保罗 |
Maria玛丽亚 | São Paulo圣保罗 |
Pedro佩德罗 | São Paulo圣保罗 |
Maria玛丽亚 | São Paulo圣保罗 |
I tried this stack and would it be exactly what I need but does not work with my entire dataset.我尝试了这个堆栈,它是否正是我需要的,但不适用于我的整个数据集。
Consider below approach (considering you have lookup table with all proper cities names) for purpose of example - I have it as CTE with just few ones考虑以下方法(考虑到您有包含所有适当城市名称的查找表)作为示例 - 我将它作为 CTE 只有几个
with cities as (
select 'São Paulo' as city union all
select 'Los Angeles' union all
select 'Dnipro' union all
select 'Kyiv'
)
select Name, City as Hometown
from your_table
left join cities
on soundex(Hometown) = soundex(city)
if applied to sample data in your question - output is如果应用于您问题中的示例数据 - output 是
Note: you obviously need to take care of potential duplication in case if some cities sounds similar, in this case adding country constraints might help...注意:您显然需要注意潜在的重复,以防某些城市听起来相似,在这种情况下,添加国家/地区限制可能会有所帮助...
First, the basics.首先,基础知识。
The first one is straight-forward, strip out everything which isn't a letter so São Paulo
and São Paulo.
第一个是直截了当的,去掉所有不是字母的东西,所以São Paulo
和São Paulo.
are both SãoPaulo
.都是SãoPaulo
。
Case folding is also straight-forward, change everything to lower or upper case.案例折叠也很简单,将所有内容更改为小写或大写。 são paulo
and São Paulo
compare the same. são paulo
和São Paulo
比较一样。
Finally, convert them to the normal ASCII equivalents.最后,将它们转换为正常的 ASCII 等价物。 For example, são
becomes sao
.例如, são
变成sao
。
With this normalization done, the issues of spaces, extra characters, accents, and cases are taken are of.完成此规范化后,空格、额外字符、重音和大小写的问题就被解决了。 I would recommend doing this outside of BigQuery and in a language like Python. Do a select distinct
and transform and compare each value using libraries such as unidecode .我建议在 BigQuery 之外使用 Python 之类的语言执行此操作。执行select distinct
并使用unidecode等库转换和比较每个值。
You can then employ some heuristics to try and find "close enough" matches.然后您可以使用一些试探法来尝试找到“足够接近”的匹配项。 One example is the Levenshtein distance which is the number of substitutions, insertions, and deletions one needs to do to turn one string into another.一个例子是Levenshtein 距离,它是将一个字符串转换为另一个字符串所需的替换、插入和删除次数。 Python has a Levenshtein library . Python 有 Levenshtein 图书馆。
For example, Sao Paul
and Sao Paulo
have a Levenshtein distance of one;例如, Sao Paul
和Sao Paulo
的 Levenshtein 距离为 1; add one letter.加一个字母。 S Paulo
and Sao Paulo
have a Levenshtein distance of two, add two letters. S Paulo
和Sao Paulo
的 Levenshtein 距离为 2,添加两个字母。 Sao Paulo
and Saint Paul
have a Levenshtein distance of four; Sao Paulo
和Saint Paul
的编辑距离为 4; change o to i, add n and t, remove o.将 o 更改为 i,添加 n 和 t,删除 o。
Again, I'd recommend doing this with a regular programming language and then writing the normalized results back to BigQuery.同样,我建议使用常规编程语言执行此操作,然后将规范化结果写回 BigQuery。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.