[英]How do I extract data from a web page with regexes?
I am writing a curl script for collecting information about some sex offenders, i have developed the script that is picking up links like given below: 我正在编写一个curl脚本来收集有关一些性犯罪者的信息,我已经开发了该脚本,该脚本正在获取如下所示的链接:
http://criminaljustice.state.ny.us/cgi/internet/nsor/... (snipped URL) http://criminaljustice.state.ny.us/cgi/internet/nsor / ... ( 剪切的 URL)
Now when we go on this link I want to get information under all the fields on this page like Offender Id:, last name etc. into my own variables. 现在,当我们单击此链接时,我想在此页面上所有字段下获取信息,例如“罪犯ID:”,“姓氏”等到我自己的变量中。 I am very weak in regex that is why I am here.
我的正则表达式非常薄弱,这就是为什么我在这里。 Or is there another way?
还是有另一种方法?
Can anybody help me in doing that? 有人可以帮我吗?
您不希望使用正则表达式(请参见可以提供一些示例,说明为什么很难用正则表达式来解析XML和HTML吗? ,请寻找适用于PHP的HTML解析器。请参见的答案) , 您可以提供一个使用您的HTML解析器示例 吗? 最喜欢的解析器?
I tend to agree with the previous poster about RegEx not being the right tool for the job. 我倾向于同意以前的海报,即RegEx并不是适合该工作的工具。 If you just want a quick and dirty expression, here goes:
如果您只想要一个快速而肮脏的表情,请执行以下操作:
Offender Id:.*
.* [0-9]*
NOTE: You must include the newline in this expression. 注意:您必须在此表达式中包括换行符。 Also note that this is very fragile as it will break if the source that your are parsing changes much at all.
还要注意,这非常脆弱,因为如果您解析的源发生了很大的变化,它将破坏。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.