简体   繁体   English

使用 RegEx 解析街道地址

[英]Parsing Street Address Using RegEx

I know there are many questions asked on this topic.我知道有很多关于这个主题的问题。 I am trying to parse and fetch street addresses from html page.我正在尝试从 html 页面解析和获取街道地址。 The format of these page do not follow any patterns.这些页面的格式不遵循任何模式。 Can someone help me in comming up with a regex that would match a street address, irrespective of the number of tags between them?有人可以帮我想出一个匹配街道地址的正则表达式,而不管它们之间的标签数量如何? Are there any other ways to do this other than using regular expressions?除了使用正则表达式之外,还有其他方法可以做到这一点吗?

Before you get all traditional let me share my experience.在您了解所有传统之前,让我分享我的经验。 I've parsed over 1 million web pages in this way in Java.我已经在 J​​ava 中以这种方式解析了超过 100 万个网页。 When I need small pieces out of a page it is perfect when paired with a replace to strip tags.当我需要从页面中取出小块时,它与替换以去除标签配对时是完美的。 In fact, it is more efficient and faster, especially when using Java's great replaceAll() function to strip tags.事实上,它更高效、更快捷,尤其是在使用 Java 伟大的 replaceAll() 函数剥离标签时。 Build a fork join pool of both and test some parsing, you won't believe your eyes.建立一个 fork join pool 并测试一些解析,你不会相信自己的眼睛。 I've added that part at the end.我在最后添加了那部分。 This is not the full regex but a starting point since it would take some trial and error to build.这不是完整的正则表达式,而是一个起点,因为构建需要一些反复试验。 I believe the statement was, a bunch of pages with no clear route to the address.我相信声明是一堆页面,没有明确的地址路线。

So, yes, there are ways.所以,是的,有办法。 What follows is a bit of an introduction to thinking about this in regex.下面是关于在正则表达式中思考这个问题的一些介绍。

Words and groups of words are always in a pattern otherwise they aren't readable.单词和单词组总是在一个模式中,否则它们是不可读的。 Still, there are several things to note.不过,有几件事需要注意。 Addresses can very greatly so it is important to continue building out a regex.地址可以非常大,因此继续构建正则表达式很重要。 The next thing, if you have access to a CAS engine, use it for anything you get.接下来,如果您可以访问 CAS 引擎,请将其用于您获得的任何内容。 It standardizes your address.它使您的地址标准化。

As a must, have you tried xml, it will narrow everything and can help get rid of tags before you format.作为必须的,您是否尝试过 xml,它将缩小所有内容,并可以帮助您在格式化之前摆脱标签。 You need to narrow everything.你需要缩小一切。 If you are using java or python, run this step in a ForkJoinPool or MultiprocessingPool.如果您使用的是 java 或 python,请在 ForkJoinPool 或 MultiprocessingPool 中运行此步骤。

Your process should be:您的流程应该是:

  1. Narrow if possible如果可能,缩小范围
  2. Execute a regex that exploits formatting执行利用格式的正则表达式

Lastly, here is a regex cheat sheet.最后,这是一个正则表达式备忘单。

Keep in mind.请记住。 I don't know what websites you are using or their formats.我不知道您使用的是什么网站或它们的格式。 I have personally had to pull this data with different per site regexes but that was for odd formats and other issues present with websites that run like databases of a certain variety.我个人不得不使用不同的每个站点正则表达式来提取这些数据,但这是针对像某种数据库一样运行的网站存在的奇怪格式和其他问题。

That said, an address has a format of numbers, then street address and apartment number of pretty much anything, then city, state, then zip code.也就是说,地址有一种数字格式,然后是街道地址和几乎任何东西的公寓号,然后是城市、州和邮政编码。 Basically it is \\d+ then any combination of letters and numbers.基本上它是 \\d+ 然后是字母和数字的任意组合。

So (in java with double backslashes) to start you off:所以(在带有双反斜杠的java中)让你开始:

[\\d]+[A-Za-z0-9\\s,\\.]+

If you want to start at but exclude tags to narrow your search if not using xml, use:如果您想在不使用 xml 的情况下开始但排除标签以缩小搜索范围,请使用:

(?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=end)

Html pages always seem to have tags so that would be something like Html 页面似乎总是有标签,所以就像

(?<=>)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=<) 

You may be able to use a zip code as your ending place if there is a multi-part zipcode.如果有多部分邮政编码,您可以使用邮政编码作为终点。

[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+

As a final note, you can chain together regexes with a pipe delimeter, eg:最后一点,您可以使用管道分隔符将正则表达式链接在一起,例如:

(?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+|(?<=start)[A-Za-z0-9\\s,\\.]+?(?=end)

If this is not narrow enough there are several additional steps:如果这还不够窄,还有几个额外的步骤:

  1. compare your results (average word length and etc.) and throw out any great outliers比较您的结果(平均字长等)并剔除任何出色的异常值
  2. write a formatter script per site to do cleanup that uses single or multi-threading to replace what you don't need.为每个站点编写一个格式化程序脚本来进行清理,使用单线程或多线程来替换您不需要的内容。

You will probably need to strip out html as well.您可能还需要删除 html。 Run this regex in a replace statement to do that.在替换语句中运行这个正则表达式来做到这一点。

<.*?>

If you have trouble, use something like my regex tester (the website not my own) to build your regex.如果您遇到问题,请使用我的正则表达式测试器(不是我自己的网站)之类的工具来构建您的正则表达式。

Having worked on this problem quite extensively at SmartyStreets, I will tell you " NO " to parsing/finding street addresses with a regex .在 SmartyStreets 对这个问题进行了相当广泛的研究后,我会告诉你”使用正则表达式解析/查找街道地址

Addresses are not a regular language and cannot be matched by a regular expression.地址不是正则语言,无法通过正则表达式进行匹配。

To solve the problem, we developed an API which actually finds and extracts addresses , with notably high accuracy.为了解决这个问题,我们开发了一个API,它实际查找和提取地址,准确率非常高。 It's free for low-volume use.小批量使用是免费的。 (It was not an easy problem to solve.) You can try it for free on the homepage demo. (这不是一个容易解决的问题。)您可以在主页演示上免费试用。 And no, this is not a solicitation.不,这不是招揽。 If you want to learn more about street addresses in any amount of detail from very basic to very technical, just email us because we want to educate the community about addresses.如果您想详细了解从非常基础到非常技术的街道地址,请给我们发送电子邮件,因为我们希望对社区进行地址教育。

To extract addresses, there are regular expressions under the hood, but results are biased strongly toward those which actually verify , meaning which actually exist.为了提取地址,引擎盖下有正则表达式,但结果强烈偏向那些实际验证的,即实际存在的。 In other words, this is a parser performing complex operations to find and match addresses.换句话说,这是一个解析器,执行复杂的操作来查找和匹配地址。

This answer to a very similar question is related, and you may find it useful. This answer to a very similar question是相关的,您可能会发现它很有用。 The other answers highlight some important points about the difficulties and solutions for parsing street addresses...其他答案突出了有关解析街道地址的困难和解决方案的一些要点......

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM