如何使用正则表达式从网页中提取数据？

Question

I am writing a curl script for collecting information about some sex offenders, i have developed the script that is picking up links like given below: 我正在编写一个curl脚本来收集有关一些性犯罪者的信息，我已经开发了该脚本，该脚本正在获取如下所示的链接：

http://criminaljustice.state.ny.us/cgi/internet/nsor/... (snipped URL) http：//criminaljustice.state.ny.us/cgi/internet/nsor / ... （剪切的 URL）

Now when we go on this link I want to get information under all the fields on this page like Offender Id:, last name etc. into my own variables. 现在，当我们单击此链接时，我想在此页面上所有字段下获取信息，例如“罪犯ID：”，“姓氏”等到我自己的变量中。 I am very weak in regex that is why I am here. 我的正则表达式非常薄弱，这就是为什么我在这里。 Or is there another way? 还是有另一种方法？

Can anybody help me in doing that? 有人可以帮我吗？

Answer 1

phpQuery is very nice for screen-scraping in PHP. phpQuery非常适合在PHP中进行屏幕抓取。 It lets you access the DOM using the same methods jQuery has. 它使您可以使用jQuery具有的相同方法来访问DOM。

Answer 2

您不希望使用正则表达式（请参见可以提供一些示例，说明为什么很难用正则表达式来解析XML和HTML吗？，请寻找适用于PHP的HTML解析器。请参见的答案），您可以提供一个使用您的HTML解析器示例吗？最喜欢的解析器？

Answer 3

I tend to agree with the previous poster about RegEx not being the right tool for the job. 我倾向于同意以前的海报，即RegEx并不是适合该工作的工具。 If you just want a quick and dirty expression, here goes: 如果您只想要一个快速而肮脏的表情，请执行以下操作：

Offender Id:.*
.*&amp;nbsp;[0-9]*

NOTE: You must include the newline in this expression. 注意：您必须在此表达式中包括换行符。 Also note that this is very fragile as it will break if the source that your are parsing changes much at all. 还要注意，这非常脆弱，因为如果您解析的源发生了很大的变化，它将破坏。

如何使用正则表达式从网页中提取数据？

问题描述

3 个解决方案

解决方案1
4 2009-04-30 21:50:51

解决方案2
1 2009-04-30 21:46:23

解决方案3
0

如何使用正则表达式从网页中提取数据？

问题描述

3 个解决方案

解决方案1 4 2009-04-30 21:50:51

解决方案2 1 2009-04-30 21:46:23

解决方案3 0

解决方案1
4 2009-04-30 21:50:51

解决方案2
1 2009-04-30 21:46:23

解决方案3
0