使用正则表达式获取页面中的所有网址时出现问题？

Question

I have a webpage source code stored in $page and I need to extract all urls from it 我有一个网页源代码存储在$ page中，我需要从中提取所有网址

the problem that some urls which are not in <a> tag, but in javascript codes. 问题是某些网址不在<a>标记中，而是在javascript代码中。

for example, I have this source code that I want to extract all urls from 例如，我有此源代码，我想从中提取所有网址

    Click <a style="vertical-align:middle;cursor:pointer;text-decoration:underline;color:red;" onClick="return downme('http://www.AAAAA.com/atnbc1i7b/part1.html')">

            Here</a> to go to download page

<a href="http://www.UUUU.com/register">Hi all</a>

and I use this regex code 我用这个正则表达式代码

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    if(preg_match_all("/$regexp/siU", $page, $matches, PREG_SET_ORDER))
               {
                 foreach($matches as $match)
                   {
                      print_r($match)
                   }

               }

the output will print to me just 输出将只打印给我

http://www.UUUU.com/register

but the other link 但是另一个链接

http://www.AAAAA.com/atnbc1i7b/part1.htm

will not appear !! 不会出现！

Help please 请帮助

thanks 谢谢

Answer 1

In first example you have: 在第一个示例中，您具有：

<a href="http://www.UUUU.com/register">

so this regexp working 所以这个正则表达式工作

but in second: 但在第二：

<a style="vertical-align:middle;cursor:pointer;text-decoration:underline;color:red;" onClick="return downme('http://www.AAAAA.com/atnbc1i7b/part1.html')">

so this not working because: 所以这不起作用，因为：

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";

as you see you have href= in regexp 如您所见，正则表达式中有href =

change href= to onClick= and try, this tip should be resolve this problem. 将href =更改为onClick =并尝试，此技巧应该可以解决此问题。

if you need href and onClick use (href|onClick) 如果您需要href和onClick使用（href | onClick）

Answer 2

Instead of matching on the <a href , try just matching on the URL: 与其在<a href上进行匹配， <a href在URL上进行匹配：

$regexp = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"

I haven't tested this out due to a current lack of facilities, but if you run your file through it, it should match anything that resembles a URL, whether it's in a href , an onclick , or just in the text. 由于当前缺乏便利性，我尚未对此进行测试，但是如果您通过它运行文件，则该文件应匹配类似于URL的任何内容，无论是在href ， onclick还是文本中。

EDIT: found a better regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls 编辑：在http://daringfireball.net/2010/07/improved_regex_for_matching_urls找到了更好的正则表达式

Answer 3

URL: Find in full text (protocol optional) Matches URLs like www.domain.com and ftp.domain.com without the http: or ftp: protocol. URL：全文查找（协议可选），匹配没有http：或ftp：协议的URL，如www.domain.com和ftp.domain.com。 The final character class makes sure that if an URL is part of some text, punctuation such as a comma or full stop after the URL is not interpreted as part of the URL. 最终的字符类确保如果URL是某些文本的一部分，则URL后的标点符号（例如逗号或句号）不会被解释为URL的一部分。

$html = <<< EOF
Click <a style="vertical-align:middle;cursor:pointer;text-decoration:underline;color:red;" onClick="return downme('http://www.AAAAA.com/atnbc1i7b/part1.html')">
Here</a> to go to download page
<a href="http://www.UUUU.com/register">Hi all</a>
EOF;

preg_match_all('/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&@#\/%=~_|$?!:,.]*[A-Z0-9+&@#\/%=~_|$]/i', $html, $urls, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($urls[0]); $i++) {
   echo  $urls[0][$i];
}

/* echo's :  
http://www.AAAAA.com/atnbc1i7b/part1.html
http://www.UUUU.com/register
*/

使用正则表达式获取页面中的所有网址时出现问题？

问题描述

3 个解决方案

解决方案1
0 2011-08-18 11:45:23

解决方案2
0 2011-08-18 11:50:30

解决方案3
0 2011-08-18 15:28:55

使用正则表达式获取页面中的所有网址时出现问题？

问题描述

3 个解决方案

解决方案1 0 2011-08-18 11:45:23

解决方案2 0 2011-08-18 11:50:30

解决方案3 0 2011-08-18 15:28:55

解决方案1
0 2011-08-18 11:45:23

解决方案2
0 2011-08-18 11:50:30

解决方案3
0 2011-08-18 15:28:55